🤖 AI Summary
Missing value imputation remains a critical challenge in tabular data preprocessing, with existing methods suffering from either insufficient accuracy or heavy reliance on training and hyperparameter tuning—lacking plug-and-play solutions. This paper introduces the first zero-shot, training-free, and hyperparameter-free imputation method: leveraging the pre-trained Transformer TabPFN, we propose an element-wise feature characterization mechanism that achieves ∼100× speedup. We design a synthetic data generation pipeline incorporating realistic missingness patterns and release MissBench—a comprehensive benchmark comprising 42 datasets and 13 distinct missing mechanisms. Through synthetic data augmentation and zero-shot transfer, our method consistently outperforms 11 state-of-the-art baselines across diverse domains—including healthcare, finance, and engineering—achieving both high accuracy and millisecond-scale inference latency. To our knowledge, this is the first practical zero-shot paradigm for tabular imputation.
📝 Abstract
Missing data is a pervasive problem in tabular settings. Existing solutions range from simple averaging to complex generative adversarial networks. However, due to huge variance in performance across real-world domains and time-consuming hyperparameter tuning, no default imputation method exists. Building on TabPFN, a recent tabular foundation model for supervised learning, we propose TabImpute, a pre-trained transformer that delivers accurate and fast zero-shot imputations requiring no fitting or hyperparameter tuning at inference-time. To train and evaluate TabImpute, we introduce (i) an entry-wise featurization for tabular settings, which enables a $100 imes$ speedup over the previous TabPFN imputation method, (ii) a synthetic training data generation pipeline incorporating realistic missingness patterns, which boosts test-time performance, and (iii) MissBench, a comprehensive benchmark for evaluation of imputation methods with $42$ OpenML datasets and $13$ missingness patterns. MissBench spans domains such as medicine, finance, and engineering, showcasing TabImpute's robust performance compared to $11$ established imputation methods.