TabDiff: a Mixed-type Diffusion Model for Tabular Data Generation

📅 2024-10-27

📈 Citations: 1

✨ Influential: 0

career value

150K/year

🤖 AI Summary

To address challenges in generating heterogeneous tabular data—namely, complex marginal distributions, difficulty in modeling inter-column dependencies, and poor imputation performance for missing values—this paper introduces the first unified continuous-time diffusion framework tailored for mixed-type (numerical and categorical) tabular data. Methodologically, we propose a feature-level learnable diffusion process coupled with a mixed-type stochastic sampler, enabling classifier-free guidance for conditional generation. Our end-to-end architecture leverages a Transformer backbone to jointly model continuous-time dynamics, incorporate self-correcting sampling, and unify numerical and categorical feature representation. Evaluated across seven benchmark datasets and eight metrics, our approach consistently outperforms state-of-the-art methods: it improves inter-column correlation estimation by up to 22.5%, significantly enhancing statistical fidelity and practical utility of generated tabular data.

Technology Category

Application Category

📝 Abstract

Synthesizing high-quality tabular data is an important topic in many data science tasks, ranging from dataset augmentation to privacy protection. However, developing expressive generative models for tabular data is challenging due to its inherent heterogeneous data types, complex inter-correlations, and intricate column-wise distributions. In this paper, we introduce TabDiff, a joint diffusion framework that models all mixed-type distributions of tabular data in one model. Our key innovation is the development of a joint continuous-time diffusion process for numerical and categorical data, where we propose feature-wise learnable diffusion processes to counter the high disparity of different feature distributions. TabDiff is parameterized by a transformer handling different input types, and the entire framework can be efficiently optimized in an end-to-end fashion. We further introduce a mixed-type stochastic sampler to automatically correct the accumulated decoding error during sampling, and propose classifier-free guidance for conditional missing column value imputation. Comprehensive experiments on seven datasets demonstrate that TabDiff achieves superior average performance over existing competitive baselines across all eight metrics, with up to $22.5%$ improvement over the state-of-the-art model on pair-wise column correlation estimations. Code is available at https://github.com/MinkaiXu/TabDiff.

Problem

Research questions and friction points this paper is trying to address.

Generating high-quality tabular data

Handling mixed-type distributions in one model

Improving column correlation estimation accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Joint continuous-time diffusion process

Feature-wise learnable diffusion processes

Mixed-type stochastic sampler

🔎 Similar Papers

CTSyn: A Foundational Model for Cross Tabular Data Generation