🤖 AI Summary
Existing data selection methods for task-specific fine-tuning of large language models (LLMs) suffer from low efficiency, reliance on auxiliary training, or weak heuristic rules. Method: This paper proposes a training-free, attention-driven data subset selection paradigm that leverages the target LLM’s in-context learning capability—specifically, its self-attention mechanism—to score and dynamically estimate sample importance without gradient updates or additional model adaptation. Contribution/Results: Evaluated on Llama-3-8B-Instruct, our method achieves superior performance using only 10% of GSM8K training data—outperforming full-data fine-tuning and surpassing the current state-of-the-art by +3.1 points in accuracy while accelerating training by 7.4×. To our knowledge, this is the first approach to enable fully training-free, model-native attention-guided data selection, establishing a new paradigm for efficient LLM adaptation.
📝 Abstract
Fine-tuning large language models (LLMs) on task-specific data is essential for their effective deployment. As dataset sizes grow, efficiently selecting optimal subsets for training becomes crucial to balancing performance and computational costs. Traditional data selection methods often require fine-tuning a scoring model on the target dataset, which is time-consuming and resource-intensive, or rely on heuristics that fail to fully leverage the model's predictive capabilities. To address these challenges, we propose Data Whisperer, an efficient, training-free, attention-based method that leverages few-shot in-context learning with the model to be fine-tuned. Comprehensive evaluations were conducted on both raw and synthetic datasets across diverse tasks and models. Notably, Data Whisperer achieves superior performance compared to the full GSM8K dataset on the Llama-3-8B-Instruct model, using just 10% of the data, and outperforms existing methods with a 3.1-point improvement and a 7.4$ imes$ speedup.