A Survey on Data Selection for LLM Instruction Tuning

📅 2024-02-04
🏛️ arXiv.org
📈 Citations: 48
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of efficiently selecting high-quality data subsets for instruction tuning to enhance LLM performance while reducing training costs. We systematically survey mainstream instruction datasets and propose, for the first time, a taxonomy of data selection methodologies specifically designed for LLM instruction tuning—establishing a “quality-driven” paradigm to supplant the conventional “quantity-driven” approach. Our framework categorizes strategies into four classes: model-based feedback, uncertainty estimation, diversity optimization, and instruction complexity modeling. We further design a comprehensive downstream evaluation suite—including AlpacaEval and MT-Bench—to enable consistent, multi-dimensional assessment. Through unified benchmarking of over 30 selection methods, we empirically demonstrate that retaining only 10–30% of high-quality samples achieves performance comparable to full-dataset tuning, significantly mitigating critical issues such as evaluation inconsistency.

Technology Category

Application Category

📝 Abstract
Instruction tuning is a vital step of training large language models (LLM), so how to enhance the effect of instruction tuning has received increased attention. Existing works indicate that the quality of the dataset is more crucial than the quantity during instruction tuning of LLM. Therefore, recently a lot of studies focus on exploring the methods of selecting high-quality subset from instruction datasets, aiming to reduce training costs and enhance the instruction-following capabilities of LLMs. This paper presents a comprehensive survey on data selection for LLM instruction tuning. Firstly, we introduce the wildly used instruction datasets. Then, we propose a new taxonomy of the data selection methods and provide a detailed introduction of recent advances,and the evaluation strategies and results of data selection methods are also elaborated in detail. Finally, we emphasize the open challenges and present new frontiers of this task.
Problem

Research questions and friction points this paper is trying to address.

Enhancing LLM instruction tuning effectiveness through data selection
Reducing training costs by selecting high-quality instruction datasets
Improving LLM instruction-following capabilities via optimized data subsets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Survey on high-quality data selection methods
Taxonomy for instruction dataset selection techniques
Evaluation strategies for data selection effectiveness
🔎 Similar Papers
J
Jiahao Wang
Harbin Institute of Technology, Institute of Automation, Chinese Academy of Sciences
B
Bolin Zhang
Harbin Institute of Technology
Q
Qianlong Du
Institute of Automation, Chinese Academy of Sciences
Jiajun Zhang
Jiajun Zhang
Institute of Automation Chinese Academy of Sciences
Natural Language ProcessingLarge Language ModelsMultimodal Information Processing
D
Dianhui Chu
Harbin Institute of Technology