What Makes Good Instruction-Tuning Data? An In-Context Learning Perspective

📅 2026-04-27

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Instruction fine-tuning datasets often contain redundant and low-quality samples, necessitating efficient data selection strategies. This work proposes a framework for instruction data selection based on weighted In-Context Influence (wICI), which, for the first time, defines effective instruction data from the perspective of in-context learning. The study reveals a negative correlation between sample difficulty and its contextual influence on semantically related examples, and establishes a connection between this metric and downstream fine-tuning performance. Through systematic ablation studies and extensive evaluations across multiple models and benchmarks, the proposed method consistently outperforms existing baselines under limited data budgets, demonstrating both its effectiveness and strong generalization capability.

📝 Abstract

Instruction-tuning datasets often contain substantial redundancy and low-quality samples, necessitating effective data selection methods. We propose an instruction data selection framework based on weighted in-context influence (wICI), which measures how effectively each candidate example reduces instruction-following difficulty for semantically related peers. Through systematic experiments, we address three key questions: what constitutes effective instruction tuning data from an in-context perspective, whether sample difficulty correlates with in-context influence, and how in-context influence translates to instruction tuning effectiveness. Experiments across multiple models and benchmarks demonstrate that our method consistently outperforms existing baselines under constrained data budgets, while empirically showing that sample difficulty negatively correlates with in-context influence.

Problem

Research questions and friction points this paper is trying to address.

instruction-tuning

data selection

in-context learning

data quality

redundancy

Innovation

Methods, ideas, or system contributions that make the work stand out.

instruction tuning

in-context learning

data selection