D3: Diversity, Difficulty, and Dependability-Aware Data Selection for Sample-Efficient LLM Instruction Tuning

📅 2025-03-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address low data utilization efficiency in large language model (LLM) instruction tuning, this paper proposes the first tri-dimensional collaborative data selection framework integrating diversity, debiased difficulty estimation, and external credibility. Methodologically: (1) generation difficulty is modeled via uncertainty estimation, explicitly decoupling it from contextual diversity; (2) an external LLM provides calibrated credibility scores for candidate instances; (3) a weighted coreset optimization algorithm is designed, enabling multi-round feedback-driven adaptive subset selection. The key innovation lies in the unified, end-to-end differentiable modeling of all three dimensions—diversity, difficulty, and credibility—within a single optimization objective. Experiments on three major benchmarks demonstrate that our method achieves comparable or superior performance to full-data fine-tuning using less than 10% of the training data, significantly enhancing both instruction-following capability and sample efficiency.

Technology Category

Application Category

📝 Abstract
Recent advancements in instruction tuning for large language models (LLMs) suggest that a small, high-quality dataset can significantly equip LLMs with instruction-following capabilities, outperforming large datasets often burdened by quality and redundancy issues. However, the challenge lies in automatically identifying valuable subsets from large datasets to boost both the effectiveness and efficiency of instruction tuning. In this paper, we first establish data selection criteria based on three distinct aspects of data value: diversity, difficulty, and dependability, and then propose the D3 method comprising two key steps of scoring and selection. Specifically, in the scoring step, we define the diversity function to measure sample distinctiveness and introduce the uncertainty-based prediction difficulty to evaluate sample difficulty by mitigating the interference of context-oriented generation diversity. Additionally, we integrate an external LLM for dependability assessment. In the selection step, we formulate the D3 weighted coreset objective, which jointly optimizes three aspects of data value to solve for the most valuable subset. The two steps of D3 can iterate multiple rounds, incorporating feedback to refine the selection focus adaptively. Experiments on three datasets demonstrate the effectiveness of D3 in endowing LLMs with competitive or even superior instruction-following capabilities using less than 10% of the entire dataset.
Problem

Research questions and friction points this paper is trying to address.

Automatically identify valuable subsets from large datasets
Enhance effectiveness and efficiency of instruction tuning
Optimize data selection based on diversity, difficulty, dependability
Innovation

Methods, ideas, or system contributions that make the work stand out.

D3 method scores data by diversity, difficulty, dependability
Ucoreset objective optimizes data value for subset selection
Iterative feedback refines selection focus adaptively
🔎 Similar Papers
No similar papers found.
J
Jia Zhang
National Key Laboratory for Novel Software Technology, Nanjing University; School of Artifical Intelligence, Nanjing University; Algorithm Tech, Taobao & Tmall Group of Alibaba
C
Chen-Xi Zhang
National Key Laboratory for Novel Software Technology, Nanjing University; School of Artifical Intelligence, Nanjing University; Algorithm Tech, Taobao & Tmall Group of Alibaba
Y
Yao Liu
Algorithm Tech, Taobao & Tmall Group of Alibaba
Yi-Xuan Jin
Yi-Xuan Jin
Nanjing University
Machine Learning
X
Xiaowen Yang
National Key Laboratory for Novel Software Technology, Nanjing University; School of Artifical Intelligence, Nanjing University
B
Bo Zheng
Algorithm Tech, Taobao & Tmall Group of Alibaba
Y
Yi Liu
Algorithm Tech, Taobao & Tmall Group of Alibaba
Lan-Zhe Guo
Lan-Zhe Guo
LAMDA Group, Nanjing University
Machine Learning