🤖 AI Summary
This work identifies severe instability in instruction-tuned large language models (IT-LLMs) when executing simple, self-contained atomic instructions—particularly sensitivity to label format (e.g., non-numeric labels cause substantial performance degradation), exposing a fundamental deficiency in basic instruction-following capability. Method: We construct refined MMLU and MMLU-Pro benchmarks and design four experimental paradigms to systematically evaluate robustness across 20 state-of-the-art IT-LLMs with respect to label formatting, instruction omission, and few-shot prompting. Contribution/Results: Minor label-format perturbations reduce accuracy by up to 30.45%; performance further deteriorates without explicit instructions; few-shot examples fail to mitigate this fragility. This is the first systematic diagnosis of atomic instruction-following vulnerability in IT-LLMs, providing critical empirical evidence and a novel evaluation framework to inform mechanistic modeling and robustness enhancement of instruction tuning.
📝 Abstract
Instruction-tuned large language models (IT-LLMs) exhibit strong zero-shot reasoning, yet their ability to execute simple, self-contained instructions remains underexplored, despite this being foundational to complex instruction-following. We evaluate 20 IT-LLMs on modified MMLU and MMLU-Pro benchmarks, by systematically varying the format of option labels (alphabetic, numeric, Roman) while keeping their meaning identical under four paradigms, namely: (1) With explicit instructions, label changes cause large performance shifts (e.g., -30.45% for Roman vs. numeric), revealing instruction-format bias. (2) Without instructions, performance drops further (up to -10.84%) and label sensitivity intensifies, underscoring the role of explicit guidance. (3) When option contents are removed, models fail random-choice baselines except with numeric labels, suggesting weak adherence to atomic directives. (4) Three-shot exemplars yield no significant gains in robustness or fidelity, and generation analyses show persistent label errors, especially for non-numeric formats. Across model sizes, larger LLMs achieve higher accuracy but remain inconsistent in instruction adherence. These results expose the insufficiencies of current instruction-tuning paradigms and highlight the need for evaluation methods and training strategies that explicitly target atomic instruction-following.