Instruction Tuning of Large Language Models for Tabular Data Generation-in One Day

📅 2025-11-28

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

Prior work on table-related tasks focuses predominantly on table question answering and reasoning, largely neglecting table generation—and typically relies on large-scale computational resources and data. Method: This paper pioneers instruction-tuning of large language models (LLMs) for table generation under low-resource constraints: only 7K high-quality instructions, a single A100 GPU, and six hours of training time. We introduce the first high-fidelity, structure-aware instruction dataset for table generation and fine-tune Llama3.1-8B-Instruct to explicitly model header semantics, row-column relationships, and inter-cell semantic consistency. Contribution/Results: Our approach achieves performance on par with GPT-4o across multiple benchmarks while reducing training cost by over 90%. It establishes a new paradigm for lightweight, high-fidelity table generation—demonstrating that effective structural modeling need not require massive resources.

Technology Category

Application Category

📝 Abstract

Tabular instruction tuning has emerged as a promising research direction for improving LLMs understanding of tabular data. However, the majority of existing works only consider question-answering and reasoning tasks over tabular data, leaving tabular data generation largely unnoticed. In this work, for the first time, we explore the efficacy of instruction tuning in improving LLMs tabular data generation capabilities. More specifically, given the high data and computation requirements of tabular instruction tuning, we aim to address the possibility of instruction tuning for tabular data generation with limited data and computational resources. To achieve this, we first create a high-quality instruction dataset for tabular data, enabling efficient LLM comprehension. We then instruction-tune an open-source LLM (Llama3.1-8B-Instruct) on the training set of this dataset to improve its tabular data generation performance. Our experimental results show that by using our high-quality dataset and instruction-tuning on only 7K instructions with an A100 GPU, for less than 6 hours, we achieve tabular data generation performance on par with the most capable commercial LLM, GPT-4o.

Problem

Research questions and friction points this paper is trying to address.

Improving LLMs tabular data generation through instruction tuning

Addressing tabular instruction tuning with limited data resources

Enhancing tabular data generation using minimal computational resources

Innovation

Methods, ideas, or system contributions that make the work stand out.

Instruction tuning for tabular data generation

High-quality dataset enables efficient LLM comprehension

Achieves GPT-4o performance with limited resources

🔎 Similar Papers

Why LLMs Are Bad at Synthetic Table Generation (and what to do about it)