Monday, April 13, 2026

Handling Missing Data - Part 1

One of the big topics I have been speaking about is “Data Cleansing for Machine Learning”.  A major part of the data cleansing process is filling in missing values.  These values could be missing for a variety of reasons (manually entered data, improper data collection, incorrect exception handling, etc.)  Missing data can be summed up into 3 categories:

  • MAR: Missing at Random, for patterns tied to known variables (e.g., age, education) where someone did not want to answer a question.
  • MCAR: Missing Completely at Random, for random glitches, accidental loss (i.e. sensor failure)
  • MNAR: Missing Not at Random, describes a situation where data is missing because of the value of the missing data itself.

 

To counter the issue of missing data, several techniques are available to fill in missing data.  In my video, I discuss 3 of those techniques.

  1. Multiple Imputation by Chained Equations (MICE) handles numeric or factor (categorical) data by converting text variables to factors or excluding them from the imputation process
  2. Principal Component Analysis (PCA) works exclusively on numerical data. However, there are alternative methods and adaptations of PCA that can be used for categorical or mixed datasets
  3. Probabilistic Principal Component Analysis (PPCA) is designed for and works only with numeric (quantitative) data

 

Method

Pros

Cons

Best Use Cases

MICE

• Flexible and can handle many variable types and complex data structures.

• Designed for MAR scenarios.

• Lower accuracy than PPCA for MCAR data.

• Requires careful predictor selection; can be difficult in high‑variable datasets.

• Surveys, questionnaires, and social‑science datasets with many variable types.
• Situations where missingness depends on observed variables (MAR).

PCA

• Captures major variance structure and reduces dimensionality.

• Does not model uncertainty in missing values.
• Less effective when data do not follow a linear low‑rank structure.

• Data with strong linear correlations.
• Situations where dimensionality reduction is also desired.

PPCA

• Higher imputation accuracy than MICE for MCAR data

• Probabilistic model better captures latent low‑rank structure. (Inference based on PPCA formulation.)

• Advantage strongest under MCAR; may not outperform MICE under MAR or MNAR. (Inference based on MICE’s MAR suitability.)

• Assumes linear Gaussian latent structure.

• Large datasets with continuous variables and low‑rank structure.
• MCAR missingness scenarios.

Summary

Choose MICE if:

Your dataset includes categorical, ordinal, or mixedtype variables. 

Missingness is likely MAR. 

You need flexible, modelbased imputations.

 

Choose PCA if:

Your data are continuous and have a strong linear structure. 

You want a simple, fast imputation method. 

Dimensionality reduction is also a goal.

 

Choose PPCA if:

Your missingness is MCAR. 

Your data follow a lowrank Gaussian structure. 

You want higher accuracy than MICE in MCAR settings.

 

No comments:

Post a Comment