CtrTab: Tabular Data Synthesis with High-Dimensional and Limited Data

📅 2025-03-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Diffusion models suffer significant performance degradation—often underperforming non-diffusion baselines—on high-dimensional sparse tabular data. To address this, we propose CtrTab, a conditional controlled diffusion framework. Its core innovation is the first use of Laplacian noise injection as an explicit controllable diffusion condition; we theoretically prove this mechanism induces implicit ℓ² regularization, effectively mitigating distributional modeling bias in small-sample, high-dimensional regimes. CtrTab jointly integrates conditional diffusion modeling, explicit noise scheduling, and domain-aware tabular feature engineering. Evaluated across multiple benchmark datasets, it surpasses existing state-of-the-art methods, achieving an average accuracy improvement exceeding 80%. Notably, in high-dimensional, low-data scenarios, CtrTab substantially enhances both generation stability and fidelity.

Technology Category

Application Category

📝 Abstract
Diffusion-based tabular data synthesis models have yielded promising results. However, we observe that when the data dimensionality increases, existing models tend to degenerate and may perform even worse than simpler, non-diffusion-based models. This is because limited training samples in high-dimensional space often hinder generative models from capturing the distribution accurately. To address this issue, we propose CtrTab-a condition controlled diffusion model for tabular data synthesis-to improve the performance of diffusion-based generative models in high-dimensional, low-data scenarios. Through CtrTab, we inject samples with added Laplace noise as control signals to improve data diversity and show its resemblance to L2 regularization, which enhances model robustness. Experimental results across multiple datasets show that CtrTab outperforms state-of-the-art models, with performance gap in accuracy over 80% on average. Our source code will be released upon paper publication.
Problem

Research questions and friction points this paper is trying to address.

Improves diffusion-based models for high-dimensional tabular data synthesis
Addresses performance degradation in high-dimensional, low-data scenarios
Enhances model robustness and data diversity with Laplace noise
Innovation

Methods, ideas, or system contributions that make the work stand out.

Condition-controlled diffusion model for tabular data synthesis
Injects Laplace noise to enhance data diversity
Improves robustness in high-dimensional, low-data scenarios
🔎 Similar Papers
No similar papers found.