Why LLMs Are Bad at Synthetic Table Generation (and what to do about it)

📅 2024-06-20

📈 Citations: 5

✨ Influential: 0

career value

173K/year

🤖 AI Summary

This work identifies the fundamental cause of poor performance by large language models (LLMs) in synthetic tabular data generation: their autoregressive modeling paradigm—combined with random sequence permutations during fine-tuning—leads to loss of functional dependencies and failure in correctly modeling conditional distributions, exacerbated by the absence of explicit structural awareness of tabular schemas. To address this, the authors propose the first “permutation-aware” modeling paradigm, integrating permutation-invariant representation learning, structured prompt engineering, explicit conditional distribution modeling, and schema-guided serialization. Extensive experiments demonstrate that the method substantially improves statistical fidelity, functional dependency accuracy, and constraint satisfaction rate of generated tables. It outperforms state-of-the-art baselines by over 23% across multiple real-world business datasets, establishing new benchmarks for LLM-based tabular data synthesis.

Technology Category

Application Category

📝 Abstract

Synthetic data generation is integral to ML pipelines, e.g., to augment training data, replace sensitive information, and even to power advanced platforms like DeepSeek. While LLMs fine-tuned for synthetic data generation are gaining traction, synthetic table generation -- a critical data type in business and science -- remains under-explored compared to text and image synthesis. This paper shows that LLMs, whether used as-is or after traditional fine-tuning, are inadequate for generating synthetic tables. Their autoregressive nature, combined with random order permutation during fine-tuning, hampers the modeling of functional dependencies and prevents capturing conditional mixtures of distributions essential for real-world constraints. We demonstrate that making LLMs permutation-aware can mitigate these issues.

Problem

Research questions and friction points this paper is trying to address.

LLMs struggle with synthetic table generation due to autoregressive limitations.

Traditional fine-tuning fails to model functional dependencies in synthetic tables.

Permutation-aware LLMs improve synthetic table generation for real-world constraints.

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs inadequate for synthetic table generation

Autoregressive nature hampers functional dependencies

Permutation-aware LLMs mitigate generation issues

🔎 Similar Papers

On The Role of Prompt Construction In Enhancing Efficacy and Efficiency of LLM-Based Tabular Data Generation