🤖 AI Summary
Existing missing-data simulation tools are fragmented, mechanism-limited (typically supporting only MCAR), and predominantly designed for numerical variables—failing to capture the complex, heterogeneous missingness patterns prevalent in real-world tabular data. Method: We propose the first open-source, unified framework enabling joint modeling of all three fundamental missingness mechanisms—MCAR, MAR, and MNAR—while natively supporting mixed-type data (numerical and categorical variables). Contribution/Results: The framework integrates four novel components: (1) mechanism-driven missingness simulation; (2) type-aware imputation evaluation; (3) interpretable visual diagnostics; and (4) formal MCAR statistical testing. Implemented in Python, it unifies statistical inference, structured evaluation metrics, and explainable visualization to cover the full pipeline—from missing-data generation and imputation to validation. It significantly enhances rigor, reproducibility, and efficiency in missingness mechanism research, algorithm benchmarking, and pedagogical applications on heterogeneous tabular data.
📝 Abstract
Incomplete data is a persistent challenge in real-world datasets, often governed by complex and unobservable missing mechanisms. Simulating missingness has become a standard approach for understanding its impact on learning and analysis. However, existing tools are fragmented, mechanism-limited, and typically focus only on numerical variables, overlooking the heterogeneous nature of real-world tabular data. We present MissMecha, an open-source Python toolkit for simulating, visualizing, and evaluating missing data under MCAR, MAR, and MNAR assumptions. MissMecha supports both numerical and categorical features, enabling mechanism-aware studies across mixed-type tabular datasets. It includes visual diagnostics, MCAR testing utilities, and type-aware imputation evaluation metrics. Designed to support data quality research, benchmarking, and education,MissMecha offers a unified platform for researchers and practitioners working with incomplete data.