🤖 AI Summary
Imputing missing values in real-world heterogeneous tabular data—comprising both numerical and categorical features—faces challenges in modeling structural dependencies and effectively fusing heterogeneous features. To address these, we propose IVGAE, a novel framework that constructs a bipartite graph between samples and features, and employs a variational graph autoencoder to capture global structural dependencies. IVGAE introduces a dual-decoder architecture: one reconstructs feature embeddings, while the other models missingness patterns, augmented by a missingness-aware structural prior. Furthermore, it integrates Transformer-based encoding to learn high-dimensional categorical embeddings without one-hot encoding. We conduct comprehensive evaluations across 16 real-world datasets under three missingness mechanisms—MCAR, MAR, and MNAR. At 30% missingness, IVGAE achieves an average 12.7% reduction in RMSE and a 3.8% improvement in downstream task F1-score over state-of-the-art methods.
📝 Abstract
Handling missing data remains a fundamental challenge in real-world tabular datasets, especially when data are heterogeneous with both numerical and categorical features. Existing imputation methods often fail to capture complex structural dependencies and handle heterogeneous data effectively. We present extbf{IVGAE}, a Variational Graph Autoencoder framework for robust imputation of incomplete heterogeneous data. IVGAE constructs a bipartite graph to represent sample-feature relationships and applies graph representation learning to model structural dependencies. A key innovation is its extit{dual-decoder architecture}, where one decoder reconstructs feature embeddings and the other models missingness patterns, providing structural priors aware of missing mechanisms. To better encode categorical variables, we introduce a Transformer-based heterogeneous embedding module that avoids high-dimensional one-hot encoding. Extensive experiments on 16 real-world datasets show that IVGAE achieves consistent improvements in RMSE and downstream F1 across MCAR, MAR, and MNAR missing scenarios under 30% missing rates. Code and data are available at: https://github.com/echoid/IVGAE.