D3: Diversity, Difficulty, and Dependability-Aware Data Selection for Sample-Efficient LLM Instruction Tuning

📅 2025-03-14

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

To address low data utilization efficiency in large language model (LLM) instruction tuning, this paper proposes the first tri-dimensional collaborative data selection framework integrating diversity, debiased difficulty estimation, and external credibility. Methodologically: (1) generation difficulty is modeled via uncertainty estimation, explicitly decoupling it from contextual diversity; (2) an external LLM provides calibrated credibility scores for candidate instances; (3) a weighted coreset optimization algorithm is designed, enabling multi-round feedback-driven adaptive subset selection. The key innovation lies in the unified, end-to-end differentiable modeling of all three dimensions—diversity, difficulty, and credibility—within a single optimization objective. Experiments on three major benchmarks demonstrate that our method achieves comparable or superior performance to full-data fine-tuning using less than 10% of the training data, significantly enhancing both instruction-following capability and sample efficiency.

Technology Category

Application Category

📝 Abstract

Recent advancements in instruction tuning for large language models (LLMs) suggest that a small, high-quality dataset can significantly equip LLMs with instruction-following capabilities, outperforming large datasets often burdened by quality and redundancy issues. However, the challenge lies in automatically identifying valuable subsets from large datasets to boost both the effectiveness and efficiency of instruction tuning. In this paper, we first establish data selection criteria based on three distinct aspects of data value: diversity, difficulty, and dependability, and then propose the D3 method comprising two key steps of scoring and selection. Specifically, in the scoring step, we define the diversity function to measure sample distinctiveness and introduce the uncertainty-based prediction difficulty to evaluate sample difficulty by mitigating the interference of context-oriented generation diversity. Additionally, we integrate an external LLM for dependability assessment. In the selection step, we formulate the D3 weighted coreset objective, which jointly optimizes three aspects of data value to solve for the most valuable subset. The two steps of D3 can iterate multiple rounds, incorporating feedback to refine the selection focus adaptively. Experiments on three datasets demonstrate the effectiveness of D3 in endowing LLMs with competitive or even superior instruction-following capabilities using less than 10% of the entire dataset.

Problem

Research questions and friction points this paper is trying to address.

Automatically identify valuable subsets from large datasets

Enhance effectiveness and efficiency of instruction tuning

Optimize data selection based on diversity, difficulty, dependability

Innovation

Methods, ideas, or system contributions that make the work stand out.

D3 method scores data by diversity, difficulty, dependability

Ucoreset objective optimizes data value for subset selection

Iterative feedback refines selection focus adaptively

🔎 Similar Papers

A Survey on Data Selection for LLM Instruction Tuning