🤖 AI Summary
Diffusion models suffer significant performance degradation—often underperforming non-diffusion baselines—on high-dimensional sparse tabular data. To address this, we propose CtrTab, a conditional controlled diffusion framework. Its core innovation is the first use of Laplacian noise injection as an explicit controllable diffusion condition; we theoretically prove this mechanism induces implicit ℓ² regularization, effectively mitigating distributional modeling bias in small-sample, high-dimensional regimes. CtrTab jointly integrates conditional diffusion modeling, explicit noise scheduling, and domain-aware tabular feature engineering. Evaluated across multiple benchmark datasets, it surpasses existing state-of-the-art methods, achieving an average accuracy improvement exceeding 80%. Notably, in high-dimensional, low-data scenarios, CtrTab substantially enhances both generation stability and fidelity.
📝 Abstract
Diffusion-based tabular data synthesis models have yielded promising results. However, we observe that when the data dimensionality increases, existing models tend to degenerate and may perform even worse than simpler, non-diffusion-based models. This is because limited training samples in high-dimensional space often hinder generative models from capturing the distribution accurately. To address this issue, we propose CtrTab-a condition controlled diffusion model for tabular data synthesis-to improve the performance of diffusion-based generative models in high-dimensional, low-data scenarios. Through CtrTab, we inject samples with added Laplace noise as control signals to improve data diversity and show its resemblance to L2 regularization, which enhances model robustness. Experimental results across multiple datasets show that CtrTab outperforms state-of-the-art models, with performance gap in accuracy over 80% on average. Our source code will be released upon paper publication.