Data Whisperer: Efficient Data Selection for Task-Specific LLM Fine-Tuning via Few-Shot In-Context Learning

📅 2025-05-18

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

Existing data selection methods for task-specific fine-tuning of large language models (LLMs) suffer from low efficiency, reliance on auxiliary training, or weak heuristic rules. Method: This paper proposes a training-free, attention-driven data subset selection paradigm that leverages the target LLM’s in-context learning capability—specifically, its self-attention mechanism—to score and dynamically estimate sample importance without gradient updates or additional model adaptation. Contribution/Results: Evaluated on Llama-3-8B-Instruct, our method achieves superior performance using only 10% of GSM8K training data—outperforming full-data fine-tuning and surpassing the current state-of-the-art by +3.1 points in accuracy while accelerating training by 7.4×. To our knowledge, this is the first approach to enable fully training-free, model-native attention-guided data selection, establishing a new paradigm for efficient LLM adaptation.

Technology Category

Application Category

📝 Abstract

Fine-tuning large language models (LLMs) on task-specific data is essential for their effective deployment. As dataset sizes grow, efficiently selecting optimal subsets for training becomes crucial to balancing performance and computational costs. Traditional data selection methods often require fine-tuning a scoring model on the target dataset, which is time-consuming and resource-intensive, or rely on heuristics that fail to fully leverage the model's predictive capabilities. To address these challenges, we propose Data Whisperer, an efficient, training-free, attention-based method that leverages few-shot in-context learning with the model to be fine-tuned. Comprehensive evaluations were conducted on both raw and synthetic datasets across diverse tasks and models. Notably, Data Whisperer achieves superior performance compared to the full GSM8K dataset on the Llama-3-8B-Instruct model, using just 10% of the data, and outperforms existing methods with a 3.1-point improvement and a 7.4$ imes$ speedup.

Problem

Research questions and friction points this paper is trying to address.

Efficiently selecting optimal data subsets for LLM fine-tuning

Overcoming resource-intensive traditional data selection methods

Leveraging few-shot learning to improve performance and speed

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free attention-based data selection method

Leverages few-shot in-context learning

Outperforms full dataset with 10% data

🔎 Similar Papers

No similar papers found.