IVGAE: Handling Incomplete Heterogeneous Data with a Variational Graph Autoencoder

📅 2025-11-27

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Imputing missing values in real-world heterogeneous tabular data—comprising both numerical and categorical features—faces challenges in modeling structural dependencies and effectively fusing heterogeneous features. To address these, we propose IVGAE, a novel framework that constructs a bipartite graph between samples and features, and employs a variational graph autoencoder to capture global structural dependencies. IVGAE introduces a dual-decoder architecture: one reconstructs feature embeddings, while the other models missingness patterns, augmented by a missingness-aware structural prior. Furthermore, it integrates Transformer-based encoding to learn high-dimensional categorical embeddings without one-hot encoding. We conduct comprehensive evaluations across 16 real-world datasets under three missingness mechanisms—MCAR, MAR, and MNAR. At 30% missingness, IVGAE achieves an average 12.7% reduction in RMSE and a 3.8% improvement in downstream task F1-score over state-of-the-art methods.

Technology Category

Application Category

📝 Abstract

Handling missing data remains a fundamental challenge in real-world tabular datasets, especially when data are heterogeneous with both numerical and categorical features. Existing imputation methods often fail to capture complex structural dependencies and handle heterogeneous data effectively. We present extbf{IVGAE}, a Variational Graph Autoencoder framework for robust imputation of incomplete heterogeneous data. IVGAE constructs a bipartite graph to represent sample-feature relationships and applies graph representation learning to model structural dependencies. A key innovation is its extit{dual-decoder architecture}, where one decoder reconstructs feature embeddings and the other models missingness patterns, providing structural priors aware of missing mechanisms. To better encode categorical variables, we introduce a Transformer-based heterogeneous embedding module that avoids high-dimensional one-hot encoding. Extensive experiments on 16 real-world datasets show that IVGAE achieves consistent improvements in RMSE and downstream F1 across MCAR, MAR, and MNAR missing scenarios under 30% missing rates. Code and data are available at: https://github.com/echoid/IVGAE.

Problem

Research questions and friction points this paper is trying to address.

Handling missing data in heterogeneous tabular datasets

Capturing structural dependencies for robust imputation

Encoding categorical variables without high-dimensional one-hot encoding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Variational Graph Autoencoder for incomplete heterogeneous data

Dual-decoder architecture modeling feature embeddings and missingness patterns

Transformer-based embedding module avoiding high-dimensional one-hot encoding

🔎 Similar Papers

No similar papers found.