🤖 AI Summary
This work reveals and systematically quantifies a paradoxical interference in large language models: while instruction following enhances alignment with human intent, it can inadvertently degrade task-solving performance. We introduce SUSTAINSCORE, a novel evaluation metric that measures this negative impact by inserting self-evident constraints—already naturally satisfied by the model’s original output—and assessing performance degradation. Through attention analysis, extraction of self-evident constraints, and multi-paradigm post-training experiments, we demonstrate the phenomenon’s prevalence across models and tasks, including mathematical reasoning, multi-hop question answering, and code generation. Our findings show that even state-of-the-art models like Claude-Sonnet-4.5 suffer significant performance drops when such constraints are added, with failure cases exhibiting heightened attention to the constraints. Moreover, different alignment strategies exhibit varying sensitivities to this interference.
📝 Abstract
Instruction following aims to align Large Language Models (LLMs) with human intent by specifying explicit constraints on how tasks should be performed. However, we reveal a counterintuitive phenomenon: instruction following can paradoxically interfere with LLMs'task-solving capability. We propose a metric, SUSTAINSCORE, to quantify the interference of instruction following with task solving. It measures task performance drop after inserting into the instruction a self-evident constraint, which is naturally met by the original successful model output and extracted from it. Experiments on current LLMs in mathematics, multi-hop QA, and code generation show that adding the self-evident constraints leads to substantial performance drops, even for advanced models such as Claude-Sonnet-4.5. We validate the generality of the interference across constraint types and scales. Furthermore, we identify common failure patterns, and by investigating the mechanisms of interference, we observe that failed cases allocate significantly more attention to constraints compared to successful ones. Finally, we use SUSTAINSCORE to conduct an initial investigation into how distinct post-training paradigms affect the interference, presenting empirical observations on current alignment strategies. We will release our code and data to facilitate further research