TableDreamer: Progressive and Weakness-guided Data Synthesis from Scratch for Table Instruction Tuning

📅 2025-06-10

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

Existing table instruction-tuning data synthesis methods suffer from two key limitations: insufficient exploration of the input space—leading to low diversity—and neglect of target models’ specific weaknesses in table understanding, resulting in poor data efficiency. This paper proposes a weakness-guided progressive data synthesis framework, introducing the first closed-loop paradigm of “weakness identification → iterative input-space sampling → joint instruction-table generation,” enabling self-bootstrapping, high-quality, and efficient synthesis. Our approach integrates multi-stage generation with GPT-4o, quantitative LLM weakness assessment, and progressive input-space exploration. Evaluated on ten diverse table reasoning benchmarks, our method achieves an average accuracy improvement of 11.62 percentage points (from 49.07% to 60.69%) for Llama3.1-8B-Instruct using only 27K synthetic samples—outperforming state-of-the-art methods that rely on substantially larger datasets.

Technology Category

Application Category

📝 Abstract

Despite the commendable progress of recent LLM-based data synthesis methods, they face two limitations in generating table instruction tuning data. First, they can not thoroughly explore the vast input space of table understanding tasks, leading to limited data diversity. Second, they ignore the weaknesses in table understanding ability of the target LLM and blindly pursue the increase of data quantity, resulting in suboptimal data efficiency. In this paper, we introduce a progressive and weakness-guided data synthesis framework tailored for table instruction tuning, named TableDreamer, to mitigate the above issues. Specifically, we first synthesize diverse tables and related instructions as seed data, and then perform an iterative exploration of the input space under the guidance of the newly identified weakness data, which eventually serve as the final training data for fine-tuning the target LLM. Extensive experiments on 10 tabular benchmarks demonstrate the effectiveness of the proposed framework, which boosts the average accuracy of Llama3.1-8B-instruct by 11.62% (49.07% to 60.69%) with 27K GPT-4o synthetic data and outperforms state-of-the-art data synthesis baselines which use more training data. The code and data is available at https://github.com/SpursGoZmy/TableDreamer

Problem

Research questions and friction points this paper is trying to address.

Limited data diversity in table instruction synthesis

Suboptimal data efficiency due to ignored LLM weaknesses

Inadequate exploration of table understanding input space

Innovation

Methods, ideas, or system contributions that make the work stand out.

Progressive weakness-guided data synthesis

Diverse table and instruction seed data

Iterative input space exploration

🔎 Similar Papers

On The Role of Prompt Construction In Enhancing Efficacy and Efficiency of LLM-Based Tabular Data Generation