Uncertainty-Aware Gradient Signal-to-Noise Data Selection for Instruction Tuning

📅 2026-01-20

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

This work addresses the prevalence of noise and redundancy in large-scale instruction tuning datasets, as well as the limitation of existing data selection methods that overlook the dynamic evolution of model uncertainty during training. To this end, we propose GRADFILTERING, a novel framework that, for the first time, integrates an uncertainty-aware mechanism into gradient analysis. By leveraging a small GPT-2 proxy model equipped with LoRA ensembles, our method dynamically computes a gradient signal-to-noise ratio (G-SNR) for each sample as a data utility score, enabling efficient and target-agnostic data filtering. Notably, GRADFILTERING avoids reliance on static proxies or costly gradient storage, and achieves performance on par with or superior to strong baselines under both LLM-as-a-judge and human evaluations. The selected subsets consistently yield faster convergence under identical computational budgets, substantially enhancing fine-tuning efficiency.

Technology Category

Application Category

📝 Abstract

Instruction tuning is a standard paradigm for adapting large language models (LLMs), but modern instruction datasets are large, noisy, and redundant, making full-data fine-tuning costly and often unnecessary. Existing data selection methods either build expensive gradient datastores or assign static scores from a weak proxy, largely ignoring evolving uncertainty, and thus missing a key source of LLM interpretability. We propose GRADFILTERING, an objective-agnostic, uncertainty-aware data selection framework that utilizes a small GPT-2 proxy with a LoRA ensemble and aggregates per-example gradients into a Gradient Signal-to-Noise Ratio (G-SNR) utility. Our method matches or surpasses random subsets and strong baselines in most LLM-as-a-judge evaluations as well as in human assessment. Moreover, GRADFILTERING-selected subsets converge faster than competitive filters under the same compute budget, reflecting the benefit of uncertainty-aware scoring.

Problem

Research questions and friction points this paper is trying to address.

instruction tuning

data selection

uncertainty

gradient signal-to-noise ratio

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

uncertainty-aware

gradient signal-to-noise ratio

data selection