T-SHIRT: Token-Selective Hierarchical Data Selection for Instruction Tuning

📅 2025-06-02

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Existing LLM instruction-tuning data selection methods suffer from two key limitations: (1) sample-level evaluation ignores token-level informativeness, and (2) scoring is brittle—easily biased by superficial lexical features. This work proposes a fine-grained, robust hierarchical data selection framework. First, we introduce *token-selective scoring*, a novel mechanism that leverages an LLM to assess the information contribution of each token—replacing coarse-grained sample-level scoring. Second, we impose a *neighborhood robustness constraint* to enforce local consistency in sample quality and mitigate lexical bias. Third, we employ a lightweight surrogate model (e.g., GPT-2) for efficient, scalable scoring. Experiments show that using only 5% of the training data, our method achieves an average +5.48 improvement over full-data baselines across eight benchmarks. Moreover, it processes 52K samples in just 40 minutes on a single GPU, significantly enhancing both efficiency and generalization.

Technology Category

Application Category

📝 Abstract

Instruction tuning is essential for Large Language Models (LLMs) to effectively follow user instructions. To improve training efficiency and reduce data redundancy, recent works use LLM-based scoring functions, e.g., Instruction-Following Difficulty (IFD), to select high-quality instruction-tuning data with scores above a threshold. While these data selection methods often lead to models that can match or even exceed the performance of models trained on the full datasets, we identify two key limitations: (i) they assess quality at the sample level, ignoring token-level informativeness; and (ii) they overlook the robustness of the scoring method, often selecting a sample due to superficial lexical features instead of its true quality. In this work, we propose Token-Selective HIeRarchical Data Selection for Instruction Tuning (T-SHIRT), a novel data selection framework that introduces a new scoring method to include only informative tokens in quality evaluation and also promotes robust and reliable samples whose neighbors also show high quality with less local inconsistencies. We demonstrate that models instruction-tuned on a curated dataset (only 5% of the original size) using T-SHIRT can outperform those trained on the entire large-scale dataset by up to 5.48 points on average across eight benchmarks. Across various LLMs and training set scales, our method consistently surpasses existing state-of-the-art data selection techniques, while also remaining both cost-effective and highly efficient. For instance, by using GPT-2 for score computation, we are able to process a dataset of 52k samples using 40 minutes on a single GPU.

Problem

Research questions and friction points this paper is trying to address.

Improves token-level informativeness in data selection

Enhances robustness of scoring methods for quality evaluation

Reduces data redundancy while maintaining high model performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Token-level informativeness in quality evaluation

Robust scoring method with local consistency checks

Cost-effective hierarchical data selection framework

🔎 Similar Papers

Unleashing the Power of Data Tsunami: A Comprehensive Survey on Data Assessment and Selection for Instruction Tuning of Language Models