🤖 AI Summary
This work addresses the pervasive issue of weak specification in instruction tuning, where a single input can yield plausible outputs under different instructions, making it difficult to assess whether an instruction genuinely guides the desired response. To tackle this, the paper formally quantifies the necessity of instructions in supervision signals by introducing the Task-Specificity Score (TSS) and its enhanced variant, TSS++. These metrics measure the degree to which an instruction determines the model’s prediction by comparing outputs generated under the original instruction versus plausible alternative instructions. The approach integrates hard negative sampling with lightweight quality evaluation to construct an efficient contrastive scoring framework. Experiments across datasets such as Alpaca, Dolly-15k, and NI-20—and models including Gemma, Llama, and Qwen—demonstrate that selecting high-task-specificity samples via TSS significantly improves downstream performance under limited token budgets, offering a novel dimension for data curation beyond conventional quality filtering.
📝 Abstract
Instruction tuning is now the default way to train and adapt large language models, but many instruction--input--output pairs are only weakly specified: for a given input, the same output can remain plausible under several alternative instructions. This raises a simple question: \emph{does the instruction uniquely determine the target output?} We propose the \textbf{Task--Specificity Score (TSS)} to quantify how much an instruction matters for predicting its output, by contrasting the true instruction against plausible alternatives for the same input. We further introduce \textbf{TSS++}, which uses hard alternatives and a small quality term to mitigate easy-negative effects. Across three instruction datasets (\textsc{Alpaca}, \textsc{Dolly-15k}, \textsc{NI-20}) and three open LLMs (Gemma, Llama, Qwen), we show that selecting task-specific examples improves downstream performance under tight token budgets and complements quality-based filters such as perplexity and IFD.