TabDLM: Free-Form Tabular Data Generation via Joint Numerical-Language Diffusion

📅 2026-02-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing approaches struggle to effectively model heterogeneous tabular data—comprising free-text, categorical, and numerical fields—within a unified framework: diffusion models often produce low-quality text, while large language models lack precision in handling numerical values. This work proposes TabDLM, the first masked diffusion language model that incorporates learnable, dedicated numerical embeddings to represent continuous numerical features through a diffusion process, while simultaneously modeling textual and categorical attributes via masked diffusion. A bidirectional cross-modal attention mechanism enables coherent joint generation across modalities. Experimental results demonstrate that TabDLM significantly outperforms current diffusion-based and large language model baselines on multiple heterogeneous tabular datasets, achieving both fluent text generation and high numerical fidelity.

Technology Category

Application Category

📝 Abstract
Synthetic tabular data generation has attracted growing attention due to its importance for data augmentation, foundation models, and privacy. However, real-world tabular datasets increasingly contain free-form text fields (e.g., reviews or clinical notes) alongside structured numerical and categorical attributes. Generating such heterogeneous tables with joint modeling of different modalities remains challenging. Existing approaches broadly fall into two categories: diffusion-based methods and LLM-based methods. Diffusion models can capture complex dependencies over numerical and categorical features in continuous or discrete spaces, but extending them to open-ended text is nontrivial and often leads to degraded text quality. In contrast, LLM-based generators naturally produce fluent text, yet their discrete tokenization can distort precise or wide-range numerical values, hindering accurate modeling of both numbers and language. In this work, we propose TabDLM, a unified framework for free-form tabular data generation via a joint numerical--language diffusion model built on masked diffusion language models (MDLMs). TabDLM models textual and categorical features through masked diffusion, while modeling numerical features with a continuous diffusion process through learned specialized numeric tokens embedding; bidirectional attention then captures cross-modality interactions within a single model. Extensive experiments on diverse benchmarks demonstrate the effectiveness of TabDLM compared to strong diffusion- and LLM-based baselines.
Problem

Research questions and friction points this paper is trying to address.

tabular data generation
free-form text
numerical features
heterogeneous data
joint modeling
Innovation

Methods, ideas, or system contributions that make the work stand out.

tabular data generation
joint numerical-language diffusion
masked diffusion language model
cross-modality interaction
synthetic data