LEAD: Iterative Data Selection for Efficient LLM Instruction Tuning

📅 2025-05-12

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

To address the high computational cost of iterative data selection in LLM instruction tuning—caused by repeated full-model inference—this paper proposes an efficient online data selection framework that requires no additional forward or backward passes. Methodologically, it introduces (1) an instance-level dynamic uncertainty (IDU) utility function, which jointly incorporates instantaneous loss, gradient approximation, and exponentially smoothed historical signals; and (2) a two-stage adaptive sampling mechanism: coarse-grained clustering guided by a multi-armed bandit, followed by fine-grained IDU-based filtering. Evaluated on four standard benchmarks, the method achieves 6.1%–10.8% average performance gains using only 2.5% of the training data, while accelerating training by 5–10×. These results substantially alleviate the efficiency–effectiveness trade-off bottleneck in instruction-tuning data selection.

Technology Category

Application Category

📝 Abstract

Instruction tuning has emerged as a critical paradigm for improving the capabilities and alignment of large language models (LLMs). However, existing iterative model-aware data selection methods incur significant computational overhead, as they rely on repeatedly performing full-dataset model inference to estimate sample utility for subsequent training iterations, creating a fundamental efficiency bottleneck. In this paper, we propose LEAD, an efficient iterative data selection framework that accurately estimates sample utility entirely within the standard training loop, eliminating the need for costly additional model inference. At its core, LEAD introduces Instance-Level Dynamic Uncertainty (IDU), a theoretically grounded utility function combining instantaneous training loss, gradient-based approximation of loss changes, and exponential smoothing of historical loss signals. To further scale efficiently to large datasets, LEAD employs a two-stage, coarse-to-fine selection strategy, adaptively prioritizing informative clusters through a multi-armed bandit mechanism, followed by precise fine-grained selection of high-utility samples using IDU. Extensive experiments across four diverse benchmarks show that LEAD significantly outperforms state-of-the-art methods, improving average model performance by 6.1%-10.8% while using only 2.5% of the training data and reducing overall training time by 5-10x.

Problem

Research questions and friction points this paper is trying to address.

Reducing computational overhead in iterative data selection for LLM tuning

Eliminating costly model inference during sample utility estimation

Improving efficiency and performance in large-scale dataset training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Instance-Level Dynamic Uncertainty for utility estimation

Two-stage coarse-to-fine data selection strategy

Eliminates costly additional model inference overhead

🔎 Similar Papers

No similar papers found.