Impugan: Learning Conditional Generative Models for Robust Data Imputation

📅 2025-12-05

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

In real-world scenarios, sensor failures, recording inconsistencies, and heterogeneity across multi-source data—such as differing sampling rates and quality levels—lead to pervasive and challenging missing values, undermining robust imputation. Conventional methods relying on linear or independence assumptions often introduce bias or excessive smoothing. To address this, we propose a conditional Generative Adversarial Network (cGAN)-based imputation framework tailored for heterogeneous data fusion: the generator reconstructs missing entries conditioned on observed features, while the discriminator, trained adversarially on real and synthetic samples, captures complex nonlinear and multimodal dependencies—thereby relaxing restrictive modeling assumptions. Evaluated on benchmark datasets and multi-source fusion tasks, our method reduces Earth Mover’s Distance by 82% and mutual information deviation by 70% compared to state-of-the-art approaches, significantly improving both imputation accuracy and distributional fidelity.

Technology Category

Application Category

📝 Abstract

Incomplete data are common in real-world applications. Sensors fail, records are inconsistent, and datasets collected from different sources often differ in scale, sampling rate, and quality. These differences create missing values that make it difficult to combine data and build reliable models. Standard imputation methods such as regression models, expectation-maximization, and multiple imputation rely on strong assumptions about linearity and independence. These assumptions rarely hold for complex or heterogeneous data, which can lead to biased or over-smoothed estimates. We propose Impugan, a conditional Generative Adversarial Network (cGAN) for imputing missing values and integrating heterogeneous datasets. The model is trained on complete samples to learn how missing variables depend on observed ones. During inference, the generator reconstructs missing entries from available features, and the discriminator enforces realism by distinguishing true from imputed data. This adversarial process allows Impugan to capture nonlinear and multimodal relationships that conventional methods cannot represent. In experiments on benchmark datasets and a multi-source integration task, Impugan achieves up to 82% lower Earth Mover's Distance (EMD) and 70% lower mutual-information deviation (MI) compared to leading baselines. These results show that adversarially trained generative models provide a scalable and principled approach for imputing and merging incomplete, heterogeneous data. Our model is available at: github.com/zalishmahmud/impuganBigData2025

Problem

Research questions and friction points this paper is trying to address.

Imputes missing values in incomplete datasets robustly

Integrates heterogeneous data from multiple diverse sources

Captures nonlinear multimodal relationships for accurate imputation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Conditional GAN for missing data imputation

Adversarial training captures nonlinear multimodal relationships

Scalable principled approach for heterogeneous data integration

🔎 Similar Papers

Learnable Prompt as Pseudo-Imputation: Rethinking the Necessity of Traditional EHR Data Imputation in Downstream Clinical Prediction