Advanced Missing Data Imputation
Mean imputation is a lie. It preserves the mean but destroys variance, correlations, and every other statistical property of your data. If 30% of a column is missing and you fill it all with the mean, you have created a spike in the distribution that never existed. Advanced imputation methods use the structure and relationships in your data to generate plausible values, preserving statistical properties and even quantifying uncertainty. This page covers every method worth knowing.
Missing Data Mechanisms
Before choosing an imputation method, you must understand why data is missing:
| Mechanism | Example | Safe to Impute? |
|---|---|---|
| MCAR | Random sensor failures | Yes — any method |
| MAR | High-income people skip income question, but you know their education level | Yes — use related variables |
| MNAR | Depressed patients skip depression score questions | Dangerous — imputed values are biased |
# missing_analysis.py — Analyze missing data patterns
import pandas as pd
import numpy as np
import matplotlib
matplotlib.use("Agg")
def analyze_missing(df: pd.DataFrame) -> dict:
"""Comprehensive missing data analysis."""
n_rows = len(df)
column_stats = []
for col in df.columns:
n_missing = df[col].isnull().sum()
pct_missing = n_missing / n_rows * 100
column_stats.append({
"column": col,
"n_missing": n_missing,
"pct_missing": pct_missing,
"dtype": str(df[col].dtype),
})
stats_df = pd.DataFrame(column_stats).sort_values(
"pct_missing", ascending=False
)
# Missing data pattern analysis
missing_pattern = df.isnull().astype(int)
pattern_counts = (
missing_pattern.groupby(list(missing_pattern.columns))
.size()
.reset_index(name="count")
.sort_values("count", ascending=False)
)
# Test MCAR using Little's test approximation
# Correlation between missingness indicators and observed values
correlations = {}
for col in df.columns:
if df[col].isnull().sum() == 0:
continue
missing_indicator = df[col].isnull().astype(int)
for other_col in df.select_dtypes(include=[np.number]).columns:
if other_col == col:
continue
corr = missing_indicator.corr(df[other_col])
if abs(corr) > 0.1:
correlations[f"{col}_missing ~ {other_col}"] = corr
return {
"column_stats": stats_df,
"n_complete_rows": (df.isnull().sum(axis=1) == 0).sum(),
"pct_complete_rows": (df.isnull().sum(axis=1) == 0).mean() * 100,
"n_patterns": len(pattern_counts),
"top_patterns": pattern_counts.head(10),
"mar_correlations": correlations,
}KNN Imputation
# knn_imputation.py — K-Nearest Neighbors imputation
import pandas as pd
import numpy as np
from sklearn.impute import KNNImputer
from sklearn.preprocessing import StandardScaler
import logging
logger = logging.getLogger(__name__)
class KNNImputerWrapper:
"""
KNN Imputation: fill missing values using the mean of K nearest neighbors.
How it works:
1. For each missing value, find K complete records most similar to this row.
2. Use the mean (or weighted mean) of their values for the missing feature.
3. Similarity is measured by Euclidean distance on non-missing features.
Pros: Captures local patterns, no distributional assumptions
Cons: Slow for large datasets (O(n^2)), sensitive to scale, struggles with high-dimensional data
"""
def __init__(
self,
n_neighbors: int = 5,
weights: str = "distance", # "uniform" or "distance"
):
self.n_neighbors = n_neighbors
self.weights = weights
self.scaler = StandardScaler()
self.imputer = KNNImputer(
n_neighbors=n_neighbors,
weights=weights,
)
def fit_transform(
self,
df: pd.DataFrame,
numeric_columns: list[str] | None = None,
) -> pd.DataFrame:
"""Impute missing values using KNN."""
result = df.copy()
cols = numeric_columns or df.select_dtypes(include=[np.number]).columns.tolist()
if not cols:
return result
# Scale before KNN (distance-based method)
scaled = self.scaler.fit_transform(result[cols])
imputed_scaled = self.imputer.fit_transform(scaled)
imputed = self.scaler.inverse_transform(imputed_scaled)
result[cols] = imputed
n_imputed = df[cols].isnull().sum().sum()
logger.info(f"KNN imputed {n_imputed} values across {len(cols)} columns")
return result
# Usage
knn = KNNImputerWrapper(n_neighbors=5, weights="distance")
df_imputed = knn.fit_transform(df, numeric_columns=["age", "income", "score"])MICE (Multiple Imputation by Chained Equations)
# mice_imputation.py — MICE iterative imputation
import pandas as pd
import numpy as np
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.linear_model import BayesianRidge, Ridge
from sklearn.ensemble import RandomForestRegressor
import logging
logger = logging.getLogger(__name__)
class MICEImputer:
"""
MICE: Multiple Imputation by Chained Equations.
How it works:
1. Initialize missing values (e.g., with mean).
2. For each column with missing values:
a. Treat it as the target variable.
b. Use all other columns as predictors.
c. Fit a regression model on observed values.
d. Predict missing values.
3. Repeat steps 2a-2d for multiple iterations until convergence.
4. Optionally, create multiple imputed datasets for uncertainty.
This is the gold standard for MAR data.
"""
def __init__(
self,
estimator: str = "bayesian_ridge",
max_iter: int = 10,
n_imputations: int = 5,
random_state: int = 42,
):
self.estimator_name = estimator
self.max_iter = max_iter
self.n_imputations = n_imputations
self.random_state = random_state
def _get_estimator(self):
estimators = {
"bayesian_ridge": BayesianRidge(),
"ridge": Ridge(alpha=1.0),
"random_forest": RandomForestRegressor(
n_estimators=50, max_depth=10, random_state=self.random_state
),
}
return estimators.get(self.estimator_name, BayesianRidge())
def single_imputation(
self,
df: pd.DataFrame,
numeric_columns: list[str] | None = None,
) -> pd.DataFrame:
"""Perform single MICE imputation."""
result = df.copy()
cols = numeric_columns or df.select_dtypes(include=[np.number]).columns.tolist()
imputer = IterativeImputer(
estimator=self._get_estimator(),
max_iter=self.max_iter,
random_state=self.random_state,
sample_posterior=False,
)
result[cols] = imputer.fit_transform(result[cols])
n_imputed = df[cols].isnull().sum().sum()
logger.info(f"MICE imputed {n_imputed} values ({self.estimator_name})")
return result
def multiple_imputation(
self,
df: pd.DataFrame,
numeric_columns: list[str] | None = None,
) -> list[pd.DataFrame]:
"""
Create multiple imputed datasets for uncertainty quantification.
Each dataset has slightly different imputed values, reflecting
uncertainty about the true values. Analyze each separately,
then combine results using Rubin's rules.
"""
cols = numeric_columns or df.select_dtypes(include=[np.number]).columns.tolist()
imputed_datasets = []
for i in range(self.n_imputations):
imputer = IterativeImputer(
estimator=self._get_estimator(),
max_iter=self.max_iter,
random_state=self.random_state + i,
sample_posterior=True, # Add randomness for multiple imputation
)
result = df.copy()
result[cols] = imputer.fit_transform(result[cols])
imputed_datasets.append(result)
logger.info(f"MICE imputation {i + 1}/{self.n_imputations} complete")
return imputed_datasets
def rubins_rules(estimates: list[float], variances: list[float]) -> dict:
"""
Combine estimates from multiple imputed datasets using Rubin's rules.
Returns pooled estimate and confidence interval.
"""
m = len(estimates)
mean_estimate = np.mean(estimates)
# Within-imputation variance
within_var = np.mean(variances)
# Between-imputation variance
between_var = np.var(estimates, ddof=1)
# Total variance
total_var = within_var + (1 + 1 / m) * between_var
# Confidence interval
se = np.sqrt(total_var)
return {
"pooled_estimate": mean_estimate,
"total_variance": total_var,
"within_variance": within_var,
"between_variance": between_var,
"standard_error": se,
"ci_lower": mean_estimate - 1.96 * se,
"ci_upper": mean_estimate + 1.96 * se,
}
# Usage
mice = MICEImputer(estimator="bayesian_ridge", n_imputations=5)
df_single = mice.single_imputation(df)
# Multiple imputation with uncertainty
imputed_datasets = mice.multiple_imputation(df)
# Example: pool regression coefficients
estimates = []
variances = []
for imp_df in imputed_datasets:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
X = imp_df[["feature1", "feature2"]].values
y = imp_df["target"].values
model.fit(X, y)
estimates.append(model.coef_[0])
# Use bootstrap or analytic formula for variance
variances.append(0.01) # Simplified
pooled = rubins_rules(estimates, variances)
print(f"Pooled coefficient: {pooled['pooled_estimate']:.4f} "
f"[{pooled['ci_lower']:.4f}, {pooled['ci_upper']:.4f}]")MissForest
# missforest.py — Random Forest-based imputation
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
import logging
logger = logging.getLogger(__name__)
class MissForest:
"""
MissForest: Non-parametric imputation using Random Forests.
Advantages over MICE:
- Handles both numeric and categorical variables natively
- Captures non-linear relationships
- No distributional assumptions
- Generally more accurate than MICE on mixed-type data
Disadvantage: slower than MICE for large datasets
"""
def __init__(
self,
max_iter: int = 10,
n_estimators: int = 100,
random_state: int = 42,
convergence_threshold: float = 1e-4,
):
self.max_iter = max_iter
self.n_estimators = n_estimators
self.random_state = random_state
self.convergence_threshold = convergence_threshold
def fit_transform(self, df: pd.DataFrame) -> pd.DataFrame:
"""Impute missing values using MissForest algorithm."""
result = df.copy()
# Identify column types
numeric_cols = result.select_dtypes(include=[np.number]).columns.tolist()
categorical_cols = result.select_dtypes(include=["object", "category"]).columns.tolist()
# Encode categoricals
encoders = {}
for col in categorical_cols:
le = LabelEncoder()
non_null = result[col].dropna()
le.fit(non_null)
result.loc[non_null.index, col] = le.transform(non_null)
encoders[col] = le
# Sort columns by number of missing values (ascending)
missing_counts = result.isnull().sum()
cols_with_missing = missing_counts[missing_counts > 0].sort_values().index.tolist()
if not cols_with_missing:
return df # Nothing to impute
# Initialize with column medians/modes
for col in cols_with_missing:
if col in numeric_cols:
result[col] = result[col].fillna(result[col].median())
else:
mode = result[col].mode()
fill_val = mode[0] if len(mode) > 0 else 0
result[col] = result[col].fillna(fill_val)
# Iterative imputation
all_cols = numeric_cols + categorical_cols
previous_values = result[cols_with_missing].values.copy()
for iteration in range(self.max_iter):
for col in cols_with_missing:
# Split into observed and missing
missing_mask = df[col].isnull()
observed_mask = ~missing_mask
if missing_mask.sum() == 0:
continue
# Features = all other columns
feature_cols = [c for c in all_cols if c != col]
X_train = result.loc[observed_mask, feature_cols].astype(float)
y_train = result.loc[observed_mask, col].astype(float)
X_predict = result.loc[missing_mask, feature_cols].astype(float)
# Choose model based on column type
if col in numeric_cols:
model = RandomForestRegressor(
n_estimators=self.n_estimators,
random_state=self.random_state,
n_jobs=-1,
)
else:
model = RandomForestClassifier(
n_estimators=self.n_estimators,
random_state=self.random_state,
n_jobs=-1,
)
model.fit(X_train, y_train)
predictions = model.predict(X_predict)
result.loc[missing_mask, col] = predictions
# Check convergence
current_values = result[cols_with_missing].values
diff = np.sum((current_values - previous_values) ** 2)
norm = np.sum(current_values ** 2)
relative_change = diff / norm if norm > 0 else 0
logger.info(
f"Iteration {iteration + 1}: relative change = {relative_change:.6f}"
)
if relative_change < self.convergence_threshold:
logger.info(f"Converged at iteration {iteration + 1}")
break
previous_values = current_values.copy()
# Decode categoricals back
for col, le in encoders.items():
result[col] = le.inverse_transform(result[col].astype(int))
return result
# Usage
mf = MissForest(max_iter=10, n_estimators=100)
df_imputed = mf.fit_transform(df)Imputation Evaluation
# imputation_evaluation.py — Evaluate imputation quality
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error, mean_absolute_error
import logging
logger = logging.getLogger(__name__)
class ImputationEvaluator:
"""
Evaluate imputation quality by artificially introducing missing values
into complete data and comparing imputed values to the truth.
"""
def __init__(self, random_state: int = 42):
self.random_state = random_state
self.rng = np.random.RandomState(random_state)
def create_artificial_missing(
self,
df: pd.DataFrame,
columns: list[str],
missing_rate: float = 0.2,
) -> tuple[pd.DataFrame, pd.DataFrame]:
"""
Introduce artificial missing values into complete data.
Returns: (df_with_missing, mask_of_artificial_missing)
"""
result = df.copy()
mask = pd.DataFrame(False, index=df.index, columns=columns)
for col in columns:
complete_rows = df[col].notna()
n_to_mask = int(complete_rows.sum() * missing_rate)
mask_indices = self.rng.choice(
df[complete_rows].index, n_to_mask, replace=False
)
result.loc[mask_indices, col] = np.nan
mask.loc[mask_indices, col] = True
return result, mask
def evaluate(
self,
original: pd.DataFrame,
imputed: pd.DataFrame,
mask: pd.DataFrame,
columns: list[str],
) -> pd.DataFrame:
"""Compare imputed values to original values where artificially masked."""
results = []
for col in columns:
masked = mask[col]
if not masked.any():
continue
true_values = original.loc[masked, col].astype(float)
pred_values = imputed.loc[masked, col].astype(float)
rmse = np.sqrt(mean_squared_error(true_values, pred_values))
mae = mean_absolute_error(true_values, pred_values)
orig_std = original[col].std()
nrmse = rmse / orig_std if orig_std > 0 else np.inf
# Distribution similarity
orig_mean = original[col].mean()
imp_mean = imputed[col].mean()
mean_shift = abs(imp_mean - orig_mean) / orig_std if orig_std > 0 else 0
orig_var = original[col].var()
imp_var = imputed[col].var()
var_ratio = imp_var / orig_var if orig_var > 0 else np.inf
results.append({
"column": col,
"n_imputed": int(masked.sum()),
"rmse": rmse,
"mae": mae,
"nrmse": nrmse,
"mean_shift": mean_shift,
"variance_ratio": var_ratio,
})
return pd.DataFrame(results)
def compare_methods(
self,
df: pd.DataFrame,
columns: list[str],
imputers: dict,
missing_rate: float = 0.2,
) -> pd.DataFrame:
"""Compare multiple imputation methods on the same data."""
df_missing, mask = self.create_artificial_missing(df, columns, missing_rate)
all_results = []
for method_name, imputer_fn in imputers.items():
logger.info(f"Evaluating: {method_name}")
try:
imputed = imputer_fn(df_missing)
eval_results = self.evaluate(df, imputed, mask, columns)
eval_results["method"] = method_name
all_results.append(eval_results)
except Exception as e:
logger.error(f"Method '{method_name}' failed: {e}")
if all_results:
return pd.concat(all_results, ignore_index=True)
return pd.DataFrame()
# Usage
evaluator = ImputationEvaluator()
# Define imputation methods to compare
from sklearn.impute import SimpleImputer, KNNImputer
def mean_impute(df):
result = df.copy()
cols = df.select_dtypes(include=[np.number]).columns
result[cols] = SimpleImputer(strategy="mean").fit_transform(result[cols])
return result
def median_impute(df):
result = df.copy()
cols = df.select_dtypes(include=[np.number]).columns
result[cols] = SimpleImputer(strategy="median").fit_transform(result[cols])
return result
def knn_impute(df):
result = df.copy()
cols = df.select_dtypes(include=[np.number]).columns
result[cols] = KNNImputer(n_neighbors=5).fit_transform(result[cols])
return result
def mice_impute(df):
mice = MICEImputer(estimator="bayesian_ridge")
return mice.single_imputation(df)
# Compare
comparison = evaluator.compare_methods(
df_complete,
columns=["age", "income", "score"],
imputers={
"mean": mean_impute,
"median": median_impute,
"knn_5": knn_impute,
"mice": mice_impute,
},
missing_rate=0.2,
)
print(comparison[["method", "column", "nrmse", "variance_ratio"]].to_string())Imputation Strategy Selection
# strategy_selection.py — Recommend imputation strategy
def recommend_imputation(
df: pd.DataFrame,
column: str,
target_column: str | None = None,
) -> str:
"""Recommend an imputation strategy based on data characteristics."""
series = df[column]
missing_pct = series.isnull().mean() * 100
n_rows = len(df)
n_numeric = len(df.select_dtypes(include=[np.number]).columns)
if missing_pct == 0:
return "No missing values"
if missing_pct > 70:
return (
"DROP COLUMN: >70% missing. Column is likely uninformative. "
"Consider adding a 'was_present' indicator before dropping."
)
if missing_pct > 40:
return (
"MICE or MissForest: High missing rate requires model-based "
"imputation to avoid bias. Create a missing indicator column. "
"Consider whether data is MNAR."
)
if pd.api.types.is_numeric_dtype(series):
if missing_pct < 5:
return (
"MEDIAN: Low missing rate, simple median imputation is "
"sufficient. Less sensitive to outliers than mean."
)
elif n_rows < 1000:
return (
"KNN (k=5): Small dataset, KNN captures local patterns "
"without needing many observations."
)
elif n_numeric > 5:
return (
"MICE (BayesianRidge): Multiple numeric columns available. "
"MICE leverages inter-column correlations. Use multiple "
"imputation if doing statistical inference."
)
else:
return (
"KNN or MICE: Moderate missing rate. Both work well. "
"KNN for non-linear relationships, MICE for linear."
)
else:
if missing_pct < 5:
return "MODE: Low missing rate, use most frequent category."
else:
return (
"MissForest: Categorical data with significant missingness. "
"MissForest handles mixed types natively."
)Quick Reference
| Method | Type | Handles Categorical | Speed | Accuracy | Preserves Variance |
|---|---|---|---|---|---|
| Mean/Median | Simple | No | Instant | Low | No |
| Mode | Simple | Yes | Instant | Low | No |
| KNN | Model-based | No (needs encoding) | Slow (O(n^2)) | Medium | Partially |
| MICE | Model-based | Via encoding | Medium | High | Yes |
| MissForest | Model-based | Yes (native) | Slow | Highest | Yes |
| EM Algorithm | Statistical | No | Medium | High | Yes |
| Matrix Factorization | Latent | No | Medium | Medium | Partially |
| Missing % | Recommended Approach |
|---|---|
| < 5% | Mean/median/mode (simple) |
| 5-20% | KNN or MICE |
| 20-40% | MICE with multiple imputation |
| 40-70% | MissForest + missing indicator |
| > 70% | Drop column + missing indicator |
| Evaluation Metric | Measures | Good Value |
|---|---|---|
| NRMSE | Accuracy of imputed values | < 0.3 |
| Mean shift | Bias in imputed distribution | < 0.1 |
| Variance ratio | Preservation of spread | 0.8 - 1.2 |
| Correlation preservation | Relationship maintenance | > 0.9 |
Key Takeaway
- Mean imputation preserves the mean but destroys variance, correlations, and distributional shape -- it is almost never the right choice for more than 5% missing data.
- MICE (Multiple Imputation by Chained Equations) is the gold standard for MAR data because it models inter-column relationships and can quantify imputation uncertainty.
- Before choosing an imputation method, determine why data is missing (MCAR, MAR, MNAR) -- MNAR data cannot be safely imputed without domain-specific models.
Exercise
Compare Imputation Methods
Given a complete DataFrame, artificially mask 20% of values in 3 numeric columns, then:
- Impute using mean, median, KNN (k=5), and MICE (BayesianRidge).
- For each method, compute NRMSE (normalized root mean squared error) against the true values.
- Compare the variance ratio (imputed variance / original variance) for each method.
- Determine which method best preserves the correlation structure.
Solution Sketch
import numpy as np, pandas as pd
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.metrics import mean_squared_error
np.random.seed(42)
df = pd.DataFrame({"a": np.random.randn(1000), "b": np.random.randn(1000)*2+5, "c": np.random.randn(1000)*0.5})
# Mask 20%
mask = np.random.random(df.shape) < 0.2
df_masked = df.copy()
df_masked[mask] = np.nan
methods = {
"mean": SimpleImputer(strategy="mean"),
"median": SimpleImputer(strategy="median"),
"knn_5": KNNImputer(n_neighbors=5),
"mice": IterativeImputer(max_iter=10, random_state=42),
}
for name, imp in methods.items():
filled = pd.DataFrame(imp.fit_transform(df_masked), columns=df.columns)
nrmse = np.sqrt(mean_squared_error(df[mask], filled[mask])) / df.std().mean()
var_ratio = filled.var().mean() / df.var().mean()
print(f"{name:8s}: NRMSE={nrmse:.4f}, VarRatio={var_ratio:.3f}")Debugging Scenario
After imputing missing values in your dataset, your linear regression model's R-squared drops from 0.85 to 0.60. The number of missing values was only 8%.
Diagnose and fix it.
Answer
Mean imputation is the most likely culprit. When 8% of values are replaced with the column mean:
- Variance shrinks: the imputed values cluster at the mean, reducing the column's spread.
- Correlations weaken: imputed values have no relationship to other columns, diluting genuine correlations that the regression relies on.
- Artificial spike: the distribution gains an artificial spike at the mean, violating the normality assumption.
Fix:
- Use MICE or KNN imputation which model inter-column relationships, preserving correlations.
- Add a missing indicator column (
column_is_missing) so the model can learn that imputed rows behave differently. - Evaluate imputation quality by comparing variance ratios and correlations before and after imputation.
- If doing statistical inference, use multiple imputation with Rubin's rules to get correct standard errors and confidence intervals.
Common Misconceptions
- "Mean imputation is simple and harmless." It destroys variance, attenuates correlations, and biases every statistical test. It is only acceptable for trivially small missing rates (< 2%).
- "KNN imputation is always better than MICE." KNN is O(n^2), struggles with high-dimensional data, and does not handle categorical variables natively. MICE scales better and handles mixed types.
- "If data is MNAR, you should still impute." MNAR means the missingness depends on the missing value itself (e.g., high-income people skip the income question). Standard imputation methods produce biased results for MNAR data.
- "One imputed dataset is sufficient." Single imputation underestimates uncertainty. Multiple imputation creates several datasets, analyzes each separately, and pools results using Rubin's rules for correct inference.
- "MissForest is always the best imputation method." MissForest is the most flexible but also the slowest. For purely numeric datasets with linear relationships, MICE with BayesianRidge is faster and equally accurate.
Quiz
1. What are the three missing data mechanisms (MCAR, MAR, MNAR)?
MCAR: missingness is completely random (sensor failure). MAR: missingness depends on observed variables (high-income people skip income question, but we know their education). MNAR: missingness depends on the missing value itself (depressed patients skip depression scores).
2. Why does mean imputation destroy variance?
All imputed values are identical (the mean), creating a spike in the distribution that shrinks the overall spread. The standard deviation decreases artificially, making the data appear more concentrated than it truly is.
3. How does MICE differ from single regression imputation?
Single regression imputes each column once using other columns as predictors. MICE iterates: it imputes column 1, then uses the imputed column 1 to improve column 2's imputation, then re-imputes column 1 using the improved column 2, repeating until convergence. This captures inter-column dependencies more accurately.
4. What is Rubin's rules, and when do you need it?
Rubin's rules combine estimates from multiple imputed datasets into a single pooled estimate with correct standard errors. You need it when doing statistical inference (hypothesis tests, confidence intervals) with multiply imputed data.
5. When should you drop a column instead of imputing it?
When more than 70% of values are missing. At that point, any imputation method is essentially fabricating the majority of the column, and the imputed values carry little genuine information. Add a binary "was_present" indicator before dropping.
One-Liner Summary: Mean imputation is a statistically destructive lie; production pipelines use MICE, KNN, or MissForest to preserve variance, correlations, and distributional shape while quantifying imputation uncertainty.