This project focuses on analyzing Missing At Random (MAR) data patterns using simulated dataset. The analysis is conducted through various statistical methods to understand and handle missing data scenarios effectively. The entire experiment is guided by many metrics that give us a way to compare the results.
5 Minutes
Abstract and study objective. Description of project structure. Description of project structure.
10 Minutes
Missing value generation mechanisms. Exploration via plots of different mechanisms.
Imputation strategies:
-
Added noise to to imputation
-
Linear with heteroscedasticity
-
Polynomial imbalanced
-
Non-polynomial (Piecewise)
Explanation of different Metrics for differences between distributional, and why we choose to use them. To measure the divergence between the original and imputed datasets, two key metrics are utilized: Wasserstein Distance to quantify distributional differences and sqrt of Jensen-Shannon Divergence to measure the similarity between probability distributions.
Visualization and comparison of imputation strategies
10 Minutes
This part will be a case study on a real dataset with missing data. In this section we will apply the techniques we have study previously taking in consideration what we learned.
-
Dataset description
- Highlight missing value mechanisms
- Exploratory data analysis
- Train-test split
-
Dataset imputation
-
Model fitting
-
Results comparison
.
├── notebooks
│ ├── partial_analyses
│ │ ├── part_1.Rmd
│ │ └── part_2.Rmd
│ └── final_results.rmd
├── src
│ ├── imputation_methods.R
│ ├── metrics.R
│ ├── missing_data.R
│ ├── plots.R
│ ├── setup.R
│ ├── synthetic_data.R
│ └── utils.R
└── README.md
README.md
: Project documentationfinal_results.html
: Project html file containing everything we will presentpart_1.Rmd
: Analyzing missing data patterns + imputation on synthetic datapart_2.Rmd
: This will contain a case study on a real datasetfinal_results.Rmd
: Comprehensive results and conclusions (the file which puts everything together)Run knit on this file to obtain the final report
synthetic_data.R
: Functions to generate a synthetic datasetimputation_methods.R
: Functions to implement different imputation techniquesmissing_data.R
: Functions to artificially generate missing datametrics.R
: Functions to evaluate different strategies to handle missing datautils.R
: Functions that are general utilitiesplots.R
: Functions to make plotssetup.R
: All libraries + setting seed (imported for each notebook)
To add new version ...
Everything that might be useful even in the future
- Outliers and missing values
- Various imputation techniques in detail
- Generating Synthetic Missing Data: A Review by Missing Mechanism
- Imputation techniques: an overview
- Jacopo Zacchigna, Devid Rosa, Ludovica Bianchi, Cristiano Baldassi
This project is part of the Statistical Methods Examination.