An Interdisciplinary and Cross-Task Review on Missing Data Imputation

📅 2025-11-02

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

Missing data are pervasive in healthcare, bioinformatics, and social sciences, severely undermining analytical reliability and decision-making quality. This paper presents a systematic review of missing data imputation methods, proposing a unified classification framework that integrates statistical foundations with modern machine learning techniques—including regression, expectation-maximization (EM), matrix completion, autoencoders, generative adversarial networks (GANs), diffusion models, graph neural networks (GNNs), and large language models (LLMs). It is the first work to formally incorporate LLMs, diffusion models, and federated learning into the missing data paradigm, explicitly addressing privacy preservation and cross-domain generalization. The review synthesizes theoretical guarantees, standardized evaluation metrics, and publicly available benchmark resources. It further identifies key research directions, including model selection strategies, interpretability, and robust generalization. By unifying conceptual foundations and practical methodologies, this work provides a comprehensive, interdisciplinary reference for researchers and practitioners tackling missing data across diverse domains.

Technology Category

Application Category

📝 Abstract

Missing data is a fundamental challenge in data science, significantly hindering analysis and decision-making across a wide range of disciplines, including healthcare, bioinformatics, social science, e-commerce, and industrial monitoring. Despite decades of research and numerous imputation methods, the literature remains fragmented across fields, creating a critical need for a comprehensive synthesis that connects statistical foundations with modern machine learning advances. This work systematically reviews core concepts-including missingness mechanisms, single versus multiple imputation, and different imputation goals-and examines problem characteristics across various domains. It provides a thorough categorization of imputation methods, spanning classical techniques (e.g., regression, the EM algorithm) to modern approaches like low-rank and high-rank matrix completion, deep learning models (autoencoders, GANs, diffusion models, graph neural networks), and large language models. Special attention is given to methods for complex data types, such as tensors, time series, streaming data, graph-structured data, categorical data, and multimodal data. Beyond methodology, we investigate the crucial integration of imputation with downstream tasks like classification, clustering, and anomaly detection, examining both sequential pipelines and joint optimization frameworks. The review also assesses theoretical guarantees, benchmarking resources, and evaluation metrics. Finally, we identify critical challenges and future directions, emphasizing model selection and hyperparameter optimization, the growing importance of privacy-preserving imputation via federated learning, and the pursuit of generalizable models that can adapt across domains and data types, thereby outlining a roadmap for future research.

Problem

Research questions and friction points this paper is trying to address.

Reviewing missing data imputation methods across diverse disciplines and tasks

Categorizing classical to modern techniques for complex data types

Investigating integration of imputation with downstream analytical tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematically reviews classical to modern imputation methods

Focuses on complex data types and downstream task integration

Identifies future directions like privacy-preserving and generalizable models

🔎 Similar Papers

Deep Learning for Multivariate Time Series Imputation: A Survey