TableDreamer: Progressive and Weakness-guided Data Synthesis from Scratch for Table Instruction Tuning

📅 2025-06-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing table instruction-tuning data synthesis methods suffer from two key limitations: insufficient exploration of the input space—leading to low diversity—and neglect of target models’ specific weaknesses in table understanding, resulting in poor data efficiency. This paper proposes a weakness-guided progressive data synthesis framework, introducing the first closed-loop paradigm of “weakness identification → iterative input-space sampling → joint instruction-table generation,” enabling self-bootstrapping, high-quality, and efficient synthesis. Our approach integrates multi-stage generation with GPT-4o, quantitative LLM weakness assessment, and progressive input-space exploration. Evaluated on ten diverse table reasoning benchmarks, our method achieves an average accuracy improvement of 11.62 percentage points (from 49.07% to 60.69%) for Llama3.1-8B-Instruct using only 27K synthetic samples—outperforming state-of-the-art methods that rely on substantially larger datasets.

Technology Category

Application Category

📝 Abstract
Despite the commendable progress of recent LLM-based data synthesis methods, they face two limitations in generating table instruction tuning data. First, they can not thoroughly explore the vast input space of table understanding tasks, leading to limited data diversity. Second, they ignore the weaknesses in table understanding ability of the target LLM and blindly pursue the increase of data quantity, resulting in suboptimal data efficiency. In this paper, we introduce a progressive and weakness-guided data synthesis framework tailored for table instruction tuning, named TableDreamer, to mitigate the above issues. Specifically, we first synthesize diverse tables and related instructions as seed data, and then perform an iterative exploration of the input space under the guidance of the newly identified weakness data, which eventually serve as the final training data for fine-tuning the target LLM. Extensive experiments on 10 tabular benchmarks demonstrate the effectiveness of the proposed framework, which boosts the average accuracy of Llama3.1-8B-instruct by 11.62% (49.07% to 60.69%) with 27K GPT-4o synthetic data and outperforms state-of-the-art data synthesis baselines which use more training data. The code and data is available at https://github.com/SpursGoZmy/TableDreamer
Problem

Research questions and friction points this paper is trying to address.

Limited data diversity in table instruction synthesis
Suboptimal data efficiency due to ignored LLM weaknesses
Inadequate exploration of table understanding input space
Innovation

Methods, ideas, or system contributions that make the work stand out.

Progressive weakness-guided data synthesis
Diverse table and instruction seed data
Iterative input space exploration
Mingyu Zheng
Mingyu Zheng
Institute of Information Engineering, CAS
NLPTable UnderstandingLLMs
Z
Zhifan Feng
Baidu Inc, Beijing, China
J
Jia Wang
Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China; School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China
Lanrui Wang
Lanrui Wang
Institute of Information Engineering, Chinese Academy of Sciences
NLPDialogue GenerationLLMs
Z
Zheng Lin
Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China; School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China
Yang Hao
Yang Hao
Baidu Inc, Beijing, China
Weiping Wang
Weiping Wang
School of Information Science and Engineering, Central South University
Computer NetworkNetwork Security