TabuLa: Harnessing Language Models for Tabular Data Synthesis

📅 2023-10-19

🏛️ arXiv.org

📈 Citations: 13

✨ Influential: 4

career value

162K/year

🤖 AI Summary

Existing large language models (LLMs) suffer from protracted training, poor generalization, and insufficient structural awareness when synthesizing tabular data, failing to meet the efficiency and security demands of privacy-sensitive applications. To address these limitations, we propose TabLLM—the first lightweight LLM framework specifically designed for structured data synthesis. Our approach discards generic pretrained weights and instead introduces a table-native architecture, a sequence compression strategy to accelerate training, and an adaptive token padding mechanism to improve intra-batch alignment and cross-domain generalization. Furthermore, we establish a transferable tabular foundation model paradigm. Evaluated on six benchmark datasets, TabLLM reduces average per-epoch training time by 46.2% while achieving significantly higher machine learning utility than state-of-the-art methods. Notably, it is the first work to empirically validate the feasibility of cross-dataset transfer learning for tabular LLMs.

📝 Abstract

Tabular data synthesis is crucial for addressing privacy and security concerns in industries reliant on tabular data. While recent advancements adopt large language models (LLMs) for realistic tabular data generation, their long training times and limited reusability hinder practical applications. In this paper, we propose Tabula, a tabular data synthesizer that leverages the structure of LLM. Unlike state-of-the-art (SOTA) LLM-based tabular data synthesizers that rely on pre-trained LLMs, Tabula discards the pre-trained weights originally designed for natural language tasks, focusing instead on a tailored approach for tabular data. In addition, Tabula introduces a token sequence compression strategy that significantly reduces training time while maintaining data quality, alongside a novel token padding method that improves sequence alignment across training batches. Experiments on six datasets show that Tabula achieves superior synthetic data utility compared to current SOTA methods. Additionally, the results demonstrate that Tabula model trained on tabular datasets serves effectively as a foundational model for synthesizing new tabular datasets. Furthermore, the proposed padding method outperforms the conventional left and right padding strategies. Finally, the results highlight that Tabula averagely reduces training time per epoch by 46.2% compared to state-of-the-art LLM approaches while achieving higher data utility. Our code is available at https://github.com/zhao-zilong/Tabula

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Table Data Synthesis

Privacy Protection

Innovation

Methods, ideas, or system contributions that make the work stand out.

TabuLa

High-quality Tabular Data Synthesis

Language Model Architecture

🔎 Similar Papers

Why LLMs Are Bad at Synthetic Table Generation (and what to do about it)