Task--Specificity Score: Measuring How Much Instructions Really Matter for Supervision

📅 2026-02-03

📈 Citations: 0

✨ Influential: 0

career value

152K/year

🤖 AI Summary

This work addresses the pervasive issue of weak specification in instruction tuning, where a single input can yield plausible outputs under different instructions, making it difficult to assess whether an instruction genuinely guides the desired response. To tackle this, the paper formally quantifies the necessity of instructions in supervision signals by introducing the Task-Specificity Score (TSS) and its enhanced variant, TSS++. These metrics measure the degree to which an instruction determines the model’s prediction by comparing outputs generated under the original instruction versus plausible alternative instructions. The approach integrates hard negative sampling with lightweight quality evaluation to construct an efficient contrastive scoring framework. Experiments across datasets such as Alpaca, Dolly-15k, and NI-20—and models including Gemma, Llama, and Qwen—demonstrate that selecting high-task-specificity samples via TSS significantly improves downstream performance under limited token budgets, offering a novel dimension for data curation beyond conventional quality filtering.

Technology Category

Application Category

📝 Abstract

Instruction tuning is now the default way to train and adapt large language models, but many instruction--input--output pairs are only weakly specified: for a given input, the same output can remain plausible under several alternative instructions. This raises a simple question: \emph{does the instruction uniquely determine the target output?} We propose the \textbf{Task--Specificity Score (TSS)} to quantify how much an instruction matters for predicting its output, by contrasting the true instruction against plausible alternatives for the same input. We further introduce \textbf{TSS++}, which uses hard alternatives and a small quality term to mitigate easy-negative effects. Across three instruction datasets (\textsc{Alpaca}, \textsc{Dolly-15k}, \textsc{NI-20}) and three open LLMs (Gemma, Llama, Qwen), we show that selecting task-specific examples improves downstream performance under tight token budgets and complements quality-based filters such as perplexity and IFD.

Problem

Research questions and friction points this paper is trying to address.

instruction tuning

task specificity

weak supervision

instruction-output alignment

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Task-Specificity Score

instruction tuning

hard negatives