🤖 AI Summary
To address low data utilization efficiency in large language model (LLM) instruction tuning, this paper proposes the first tri-dimensional collaborative data selection framework integrating diversity, debiased difficulty estimation, and external credibility. Methodologically: (1) generation difficulty is modeled via uncertainty estimation, explicitly decoupling it from contextual diversity; (2) an external LLM provides calibrated credibility scores for candidate instances; (3) a weighted coreset optimization algorithm is designed, enabling multi-round feedback-driven adaptive subset selection. The key innovation lies in the unified, end-to-end differentiable modeling of all three dimensions—diversity, difficulty, and credibility—within a single optimization objective. Experiments on three major benchmarks demonstrate that our method achieves comparable or superior performance to full-data fine-tuning using less than 10% of the training data, significantly enhancing both instruction-following capability and sample efficiency.
📝 Abstract
Recent advancements in instruction tuning for large language models (LLMs) suggest that a small, high-quality dataset can significantly equip LLMs with instruction-following capabilities, outperforming large datasets often burdened by quality and redundancy issues. However, the challenge lies in automatically identifying valuable subsets from large datasets to boost both the effectiveness and efficiency of instruction tuning. In this paper, we first establish data selection criteria based on three distinct aspects of data value: diversity, difficulty, and dependability, and then propose the D3 method comprising two key steps of scoring and selection. Specifically, in the scoring step, we define the diversity function to measure sample distinctiveness and introduce the uncertainty-based prediction difficulty to evaluate sample difficulty by mitigating the interference of context-oriented generation diversity. Additionally, we integrate an external LLM for dependability assessment. In the selection step, we formulate the D3 weighted coreset objective, which jointly optimizes three aspects of data value to solve for the most valuable subset. The two steps of D3 can iterate multiple rounds, incorporating feedback to refine the selection focus adaptively. Experiments on three datasets demonstrate the effectiveness of D3 in endowing LLMs with competitive or even superior instruction-following capabilities using less than 10% of the entire dataset.