ICon: In-Context Contribution for Automatic Data Selection

📅 2025-05-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Instruction tuning often relies on expensive gradient-based computations or manually designed heuristics for data selection, hindering scalability and generalizability. Method: This paper introduces the first gradient-free, prior-knowledge-agnostic data evaluation paradigm driven by implicit in-context learning (ICL). It quantifies sample value by analyzing implicit performance shifts during ICL, establishing a three-stage contribution scoring framework to automatically identify high-value instances exhibiting both task diversity and moderate difficulty. Results: Experiments on LLaMA3.1-8B demonstrate that training on only 15% of the filtered dataset achieves a 5.42-percentage-point improvement over full-data fine-tuning and outperforms the current state-of-the-art method by 2.06 points. The approach significantly reduces computational cost and eliminates manual intervention, enabling efficient, scalable, and principled instruction tuning.

Technology Category

Application Category

📝 Abstract
Data selection for instruction tuning is essential for improving the performance of Large Language Models (LLMs) and reducing training cost. However, existing automated selection methods either depend on computationally expensive gradient-based measures or manually designed heuristics, which may fail to fully exploit the intrinsic attributes of data. In this paper, we propose In-context Learning for Contribution Measurement (ICon), a novel gradient-free method that takes advantage of the implicit fine-tuning nature of in-context learning (ICL) to measure sample contribution without gradient computation or manual indicators engineering. ICon offers a computationally efficient alternative to gradient-based methods and reduces human inductive bias inherent in heuristic-based approaches. ICon comprises three components and identifies high-contribution data by assessing performance shifts under implicit learning through ICL. Extensive experiments on three LLMs across 12 benchmarks and 5 pairwise evaluation sets demonstrate the effectiveness of ICon. Remarkably, on LLaMA3.1-8B, models trained on 15% of ICon-selected data outperform full datasets by 5.42% points and exceed the best performance of widely used selection methods by 2.06% points. We further analyze high-contribution samples selected by ICon, which show both diverse tasks and appropriate difficulty levels, rather than just the hardest ones.
Problem

Research questions and friction points this paper is trying to address.

Automated data selection for LLM instruction tuning
Eliminating gradient-based or heuristic-dependent methods
Measuring sample contribution via in-context learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Gradient-free data selection using in-context learning
Efficient alternative to gradient-based selection methods
Reduces human bias in heuristic-based approaches
🔎 Similar Papers
No similar papers found.
Y
Yixin Yang
State Key Laboratory of Multimedia Information Processing, Peking University
Qingxiu Dong
Qingxiu Dong
Peking University
Natural Language ProcessingMachine Learning
Linli Yao
Linli Yao
Peking University
multi-modal semantic understanding
Fangwei Zhu
Fangwei Zhu
Peking University
Z
Zhifang Sui
State Key Laboratory of Multimedia Information Processing, Peking University