A Survey on Data Selection for LLM Instruction Tuning

📅 2024-02-04

🏛️ arXiv.org

📈 Citations: 48

✨ Influential: 0

career value

221K/year

🤖 AI Summary

This study addresses the challenge of efficiently selecting high-quality data subsets for instruction tuning to enhance LLM performance while reducing training costs. We systematically survey mainstream instruction datasets and propose, for the first time, a taxonomy of data selection methodologies specifically designed for LLM instruction tuning—establishing a “quality-driven” paradigm to supplant the conventional “quantity-driven” approach. Our framework categorizes strategies into four classes: model-based feedback, uncertainty estimation, diversity optimization, and instruction complexity modeling. We further design a comprehensive downstream evaluation suite—including AlpacaEval and MT-Bench—to enable consistent, multi-dimensional assessment. Through unified benchmarking of over 30 selection methods, we empirically demonstrate that retaining only 10–30% of high-quality samples achieves performance comparable to full-dataset tuning, significantly mitigating critical issues such as evaluation inconsistency.

Technology Category

Application Category

📝 Abstract

Instruction tuning is a vital step of training large language models (LLM), so how to enhance the effect of instruction tuning has received increased attention. Existing works indicate that the quality of the dataset is more crucial than the quantity during instruction tuning of LLM. Therefore, recently a lot of studies focus on exploring the methods of selecting high-quality subset from instruction datasets, aiming to reduce training costs and enhance the instruction-following capabilities of LLMs. This paper presents a comprehensive survey on data selection for LLM instruction tuning. Firstly, we introduce the wildly used instruction datasets. Then, we propose a new taxonomy of the data selection methods and provide a detailed introduction of recent advances,and the evaluation strategies and results of data selection methods are also elaborated in detail. Finally, we emphasize the open challenges and present new frontiers of this task.

Problem

Research questions and friction points this paper is trying to address.

Enhancing LLM instruction tuning effectiveness through data selection

Reducing training costs by selecting high-quality instruction datasets

Improving LLM instruction-following capabilities via optimized data subsets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Survey on high-quality data selection methods

Taxonomy for instruction dataset selection techniques

Evaluation strategies for data selection effectiveness

🔎 Similar Papers

Instruction Tuning for Large Language Models: A Survey