Importance-Aware Data Selection for Efficient LLM Instruction Tuning

📅 2025-11-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the inefficiency of data selection in instruction tuning, this paper proposes a data importance-aware filtering method. The core innovation is the introduction of the Model Instruction Weakness Value (MIWV), a dynamic metric that quantifies the contribution of each instruction instance toward mitigating model capability gaps—defined as the discrepancy between the model’s in-context learning (ICL) response and the ideal output. This formulation departs from conventional static quality scoring paradigms. By ranking and selecting instruction data according to MIWV, experiments demonstrate that fine-tuning on only the top 1% of instances yields superior performance across multiple benchmarks compared to full-dataset training. The method significantly improves instruction tuning efficiency while providing an interpretable, reproducible criterion for high-quality dataset curation.

Technology Category

Application Category

📝 Abstract
Instruction tuning plays a critical role in enhancing the performance and efficiency of Large Language Models (LLMs). Its success depends not only on the quality of the instruction data but also on the inherent capabilities of the LLM itself. Some studies suggest that even a small amount of high-quality data can achieve instruction fine-tuning results that are on par with, or even exceed, those from using a full-scale dataset. However, rather than focusing solely on calculating data quality scores to evaluate instruction data, there is a growing need to select high-quality data that maximally enhances the performance of instruction tuning for a given LLM. In this paper, we propose the Model Instruction Weakness Value (MIWV) as a novel metric to quantify the importance of instruction data in enhancing model's capabilities. The MIWV metric is derived from the discrepancies in the model's responses when using In-Context Learning (ICL), helping identify the most beneficial data for enhancing instruction tuning performance. Our experimental results demonstrate that selecting only the top 1% of data based on MIWV can outperform training on the full dataset. Furthermore, this approach extends beyond existing research that focuses on data quality scoring for data selection, offering strong empirical evidence supporting the effectiveness of our proposed method.
Problem

Research questions and friction points this paper is trying to address.

Proposes MIWV metric to identify instruction data that improves LLM performance
Selects high-quality data for efficient instruction tuning using model weakness analysis
Demonstrates top 1% MIWV-selected data outperforms full dataset training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes Model Instruction Weakness Value metric
Uses model response discrepancies for data selection
Selects top 1% data to outperform full dataset
🔎 Similar Papers
No similar papers found.
T
Tingyu Jiang
Alibaba Cloud Computing
S
Shen Li
Alibaba Cloud Computing
Y
Yiyao Song
Alibaba Cloud Computing
L
Lan Zhang
Independent Researcher
H
Hualei Zhu
Alibaba Cloud Computing
Yuan Zhao
Yuan Zhao
Lanzhou University of Technology
time series forecasting
Xiaohang Xu
Xiaohang Xu
Postdoc at the University of Tokyo
Spatial-temporal data miningRecommendation systemFederated learning
K
K. Taura
Graduate School of Information Science and Technology, The University of Tokyo
H
Hao Henry Wang
Alibaba Cloud Computing