The Atomic Instruction Gap: Instruction-Tuned LLMs Struggle with Simple, Self-Contained Directives

📅 2025-10-20

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

This work identifies severe instability in instruction-tuned large language models (IT-LLMs) when executing simple, self-contained atomic instructions—particularly sensitivity to label format (e.g., non-numeric labels cause substantial performance degradation), exposing a fundamental deficiency in basic instruction-following capability. Method: We construct refined MMLU and MMLU-Pro benchmarks and design four experimental paradigms to systematically evaluate robustness across 20 state-of-the-art IT-LLMs with respect to label formatting, instruction omission, and few-shot prompting. Contribution/Results: Minor label-format perturbations reduce accuracy by up to 30.45%; performance further deteriorates without explicit instructions; few-shot examples fail to mitigate this fragility. This is the first systematic diagnosis of atomic instruction-following vulnerability in IT-LLMs, providing critical empirical evidence and a novel evaluation framework to inform mechanistic modeling and robustness enhancement of instruction tuning.

Technology Category

Application Category

📝 Abstract

Instruction-tuned large language models (IT-LLMs) exhibit strong zero-shot reasoning, yet their ability to execute simple, self-contained instructions remains underexplored, despite this being foundational to complex instruction-following. We evaluate 20 IT-LLMs on modified MMLU and MMLU-Pro benchmarks, by systematically varying the format of option labels (alphabetic, numeric, Roman) while keeping their meaning identical under four paradigms, namely: (1) With explicit instructions, label changes cause large performance shifts (e.g., -30.45% for Roman vs. numeric), revealing instruction-format bias. (2) Without instructions, performance drops further (up to -10.84%) and label sensitivity intensifies, underscoring the role of explicit guidance. (3) When option contents are removed, models fail random-choice baselines except with numeric labels, suggesting weak adherence to atomic directives. (4) Three-shot exemplars yield no significant gains in robustness or fidelity, and generation analyses show persistent label errors, especially for non-numeric formats. Across model sizes, larger LLMs achieve higher accuracy but remain inconsistent in instruction adherence. These results expose the insufficiencies of current instruction-tuning paradigms and highlight the need for evaluation methods and training strategies that explicitly target atomic instruction-following.

Problem

Research questions and friction points this paper is trying to address.

Instruction-tuned LLMs struggle with simple self-contained atomic directives

Models show performance sensitivity to option label format variations

Current instruction-tuning fails to ensure consistent atomic instruction adherence

Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematically varied option label formats

Evaluated instruction adherence across four paradigms

Identified need for atomic instruction-focused training

🔎 Similar Papers

No similar papers found.