Task--Specificity Score: Measuring How Much Instructions Really Matter for Supervision

📅 2026-02-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the pervasive issue of weak specification in instruction tuning, where a single input can yield plausible outputs under different instructions, making it difficult to assess whether an instruction genuinely guides the desired response. To tackle this, the paper formally quantifies the necessity of instructions in supervision signals by introducing the Task-Specificity Score (TSS) and its enhanced variant, TSS++. These metrics measure the degree to which an instruction determines the model’s prediction by comparing outputs generated under the original instruction versus plausible alternative instructions. The approach integrates hard negative sampling with lightweight quality evaluation to construct an efficient contrastive scoring framework. Experiments across datasets such as Alpaca, Dolly-15k, and NI-20—and models including Gemma, Llama, and Qwen—demonstrate that selecting high-task-specificity samples via TSS significantly improves downstream performance under limited token budgets, offering a novel dimension for data curation beyond conventional quality filtering.

Technology Category

Application Category

📝 Abstract
Instruction tuning is now the default way to train and adapt large language models, but many instruction--input--output pairs are only weakly specified: for a given input, the same output can remain plausible under several alternative instructions. This raises a simple question: \emph{does the instruction uniquely determine the target output?} We propose the \textbf{Task--Specificity Score (TSS)} to quantify how much an instruction matters for predicting its output, by contrasting the true instruction against plausible alternatives for the same input. We further introduce \textbf{TSS++}, which uses hard alternatives and a small quality term to mitigate easy-negative effects. Across three instruction datasets (\textsc{Alpaca}, \textsc{Dolly-15k}, \textsc{NI-20}) and three open LLMs (Gemma, Llama, Qwen), we show that selecting task-specific examples improves downstream performance under tight token budgets and complements quality-based filters such as perplexity and IFD.
Problem

Research questions and friction points this paper is trying to address.

instruction tuning
task specificity
weak supervision
instruction-output alignment
large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Task-Specificity Score
instruction tuning
hard negatives
example selection
large language models
🔎 Similar Papers
No similar papers found.
P
Pritam Kadasi
Lingo Research Group, Indian Institute of Technology Gandhinagar, India
A
Abhishek Upperwal
Soket AI, India
Mayank Singh
Mayank Singh
Assistant Professor, Computer Science and Engineering, IIT Gandhinagar
LLMsInterpretabilityExplainabilityCode-mixingNLP/ML/AI