A reproducible Python framework for simulating realistic missingness patterns in time-series data for imputation benchmarking and evaluation.
This library implements statistically grounded missingness mechanisms—MCAR, MAR, and MNAR—with precise or calibrated missing-rate control, support for multivariate and multi-subject time series, optional block (contiguous) dropout, and full reproducibility via seeded random generators.
It is designed for research-grade evaluation of imputation models, especially in healthcare and sensor data settings where missingness is structured, correlated, and non-random.
Most imputation benchmarks rely on simplistic random masking that does not reflect real-world data collection processes. In practice:
- Sensors fail during activity
- Devices drop out for contiguous time windows
- Extreme values are more likely to be missing
- Missingness is correlated with other variables
This library provides a unified, configurable, and reproducible framework to simulate these patterns while preserving a known ground truth for fair model comparison.
- Three standard missingness mechanisms: MCAR, MAR, MNAR
- Optional block (contiguous) missingness to simulate sensor dropout
- Supports both:
- 2D arrays:
(T, D)(time × features) - 3D arrays:
(N, T, D)(subjects × time × features)
- 2D arrays:
- Exact missing-rate control for MCAR
- Calibrated missing-rate control for MAR/MNAR via binary search
- Fully reproducible using NumPy’s
GeneratorAPI - Respects existing NaNs in the data
- Returns both:
X_missing(with NaNs inserted)mask(True = observed,False = missing)
mask == True→ observed valuemask == False→ missing value
This allows direct evaluation using:
missing_idx = ~mask
error = np.mean((X[missing_idx] - X_imputed[missing_idx])**2)pip install -e .import numpy as np
from ts_missingness import simulate_missingness
X = np.random.randn(1000, 6)
X_miss, mask = simulate_missingness(
X, mechanism="mar", missing_rate=0.2, seed=42, driver_dims=[0]
)
print("Actual missing rate:", (~mask).mean())Definition
Missingness is independent of both observed and unobserved data.
Mathematical model
P(M_ij = 1) = ρ
Implementation
- Uniform random sampling without replacement
- Exactly ⌊n × ρ⌋ positions are masked among eligible entries
- Guarantees precise missing-rate control
Use cases
- Random packet loss
- Uncorrelated sensor glitches
- Transmission errors
Definition
Missingness depends on observed variables, but not on the missing value itself.
Model
P(M_ij = 1 | X) = σ(α · z_i + β)
z_i = (driver_i - μ) / σ
σ(x) = 1 / (1 + exp(-x))
Procedure
- Compute a driver signal from specified dimensions
- Normalize the driver
- Convert to probabilities using a sigmoid
- Calibrate offset β via binary search to match target missing rate ρ
- Sample Bernoulli(p_ij) at eligible positions
Use cases
- Sensor failure during high activity
- Dropout correlated with physiological state
- Context-dependent data loss
Definition
Missingness depends on the value itself (unobserved when missing). This is the most challenging and least identifiable setting.
Model
P(M_ij = 1 | X_ij) = σ(α · f(z_ij) + β)
z_ij = (X_ij - μ_j) / σ_j
f(z) = z (mode="high")
-z (mode="low")
|z| (mode="extreme")
Procedure
- Normalize each dimension independently
- Compute score based on mode
- Apply sigmoid to obtain probabilities
- Calibrate offset β to achieve target missing rate
- Sample Bernoulli(p_ij)
Use cases
- Sensor saturation at extremes
- Ceiling/floor effects
- Detection limits
Purpose
Simulates contiguous dropout periods, common in real sensor data.
Behavior
- Applied as a post-processing step on top of MCAR/MAR/MNAR
- Preserves global missing rate
- Increases temporal correlation of missingness
Parameters
block=Trueblock_len: length of each missing segmentblock_density: fraction of missingness placed into blocks
Use cases
- Battery depletion
- Device removal
- Connectivity loss
# MCAR: 15% random missing
X_miss, mask = simulate_missingness(X, "mcar", missing_rate=0.15, seed=42)
# MAR: 25% missing driven by dimension 0
X_miss, mask = simulate_missingness(
X, "mar", missing_rate=0.25, seed=42, driver_dims=[0], strength=2.0
)
# MNAR: 10% extreme values missing
X_miss, mask = simulate_missingness(
X, "mnar", missing_rate=0.10, seed=42, mnar_mode="extreme", strength=2.0
)
# Block missingness
X_miss, mask = simulate_missingness(
X, "mcar", missing_rate=0.20, seed=42, block=True, block_len=60, block_density=0.7
)X_miss, mask = simulate_missingness(X, "mcar", 0.20, seed=42)
X_imputed = your_imputation_method(X_miss)
missing_idx = ~mask
rmse = np.sqrt(np.mean((X[missing_idx] - X_imputed[missing_idx])**2))
mae = np.mean(np.abs(X[missing_idx] - X_imputed[missing_idx]))
print(f"RMSE: {rmse:.4f}, MAE: {mae:.4f}")Parameters
X: array of shape(T, D)or(N, T, D)mechanism:"mcar","mar","mnar"missing_rate: float in[0, 1]seed: optional int**kwargs: mechanism-specific options
Returns
X_missing: array with NaNs insertedmask: boolean array (True = observed,False = missing)
- Uses NumPy’s
GeneratorAPI - No reliance on global RNG state
- Same seed → identical masks
pytest ts_missingness/tests/@software{ts_missingness,
author = {Feruz Oripov},
title = {Time-Series Missingness Simulation Library},
year = {2026},
url = {https://github.com/feruzoripov/ts_missingness}
}MIT