Dealing with missing data

Project Overview

This project focuses on analyzing Missing At Random (MAR) data patterns using simulated dataset. The analysis is conducted through various statistical methods to understand and handle missing data scenarios effectively. The entire experiment is guided by many metrics that give us a way to compare the results.

Presentation Outline

5 Minutes

Abstract and study objective. Description of project structure. Description of project structure.

Part 1: Synthetic Data Study

10 Minutes

Missing value generation mechanisms. Exploration via plots of different mechanisms.

Imputation strategies:

Added noise to to imputation
Linear with heteroscedasticity
Polynomial imbalanced
Non-polynomial (Piecewise)

Explanation of different Metrics for differences between distributional, and why we choose to use them. To measure the divergence between the original and imputed datasets, two key metrics are utilized: Wasserstein Distance to quantify distributional differences and sqrt of Jensen-Shannon Divergence to measure the similarity between probability distributions.

Visualization and comparison of imputation strategies

Part 2: Case Study

10 Minutes

This part will be a case study on a real dataset with missing data. In this section we will apply the techniques we have study previously taking in consideration what we learned.

Dataset description
- Highlight missing value mechanisms
- Exploratory data analysis
- Train-test split
Dataset imputation
Model fitting
Results comparison

Project Structure

.
├── notebooks
│   ├── partial_analyses
│   │   ├── part_1.Rmd
│   │   └── part_2.Rmd
│   └── final_results.rmd
├── src
│   ├── imputation_methods.R
│   ├── metrics.R
│   ├── missing_data.R
│   ├── plots.R
│   ├── setup.R
│   ├── synthetic_data.R
│   └── utils.R
└── README.md

README.md: Project documentation
final_results.html: Project html file containing everything we will present
part_1.Rmd: Analyzing missing data patterns + imputation on synthetic data
part_2.Rmd: This will contain a case study on a real dataset
final_results.Rmd: Comprehensive results and conclusions (the file which puts everything together)

Run knit on this file to obtain the final report

Utilities

synthetic_data.R: Functions to generate a synthetic dataset
imputation_methods.R: Functions to implement different imputation techniques
missing_data.R: Functions to artificially generate missing data
metrics.R: Functions to evaluate different strategies to handle missing data
utils.R: Functions that are general utilities
plots.R: Functions to make plots
setup.R: All libraries + setting seed (imported for each notebook)

Project Map

To add new version ...

Resources from literature

Everything that might be useful even in the future

Contributors

Jacopo Zacchigna, Devid Rosa, Ludovica Bianchi, Cristiano Baldassi

This project is part of the Statistical Methods Examination.

Name		Name	Last commit message	Last commit date
Latest commit History 133 Commits
.assets		.assets
.github/workflows		.github/workflows
notebooks		notebooks
notes		notes
src		src
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Dealing with missing data

Project Overview

Presentation Outline

Part 1: Synthetic Data Study

Part 2: Case Study

Project Structure

Utilities

Project Map

Resources from literature

Contributors

About

Uh oh!

Releases

Packages

Contributors 4

Uh oh!

Languages

Jac-Zac/Stat_Missing_Data_Project

Folders and files

Latest commit

History

Repository files navigation

Dealing with missing data

Project Overview

Presentation Outline

Part 1: Synthetic Data Study

Part 2: Case Study

Project Structure

Utilities

Project Map

Resources from literature

Contributors

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

Packages