Reasoning Model is Stubborn: Diagnosing Instruction Overriding in Reasoning Models

📅 2025-05-22

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

This work identifies “reasoning rigidity” in large language models (LLMs): a systematic failure to adhere to explicit user instructions and constraints during complex mathematical and logical reasoning, stemming from overreliance on habitual inference paths. To diagnose this phenomenon, we introduce the first expert-annotated, domain-specific benchmark—covering AIME, MATH500, and reconstructed logic puzzles—and formally define reasoning rigidity. Through benchmark redesign, behavioral attribution analysis, and a combined qualitative–quantitative diagnostic framework, we empirically identify three instruction-overriding mechanisms: explanation overload, input distrust, and partial instruction attention. Our analysis precisely localizes the root causes of constraint violation, revealing deep connections between training data contamination and entrenched reasoning habits. The dataset is publicly released, establishing critical infrastructure and theoretical foundations for advancing robust, instruction-compliant reasoning in LLMs.

Technology Category

Application Category

📝 Abstract

Large language models have demonstrated remarkable proficiency in long and complex reasoning tasks. However, they frequently exhibit a problematic reliance on familiar reasoning patterns, a phenomenon we term extit{reasoning rigidity}. Despite explicit instructions from users, these models often override clearly stated conditions and default to habitual reasoning trajectories, leading to incorrect conclusions. This behavior presents significant challenges, particularly in domains such as mathematics and logic puzzle, where precise adherence to specified constraints is critical. To systematically investigate reasoning rigidity, a behavior largely unexplored in prior work, we introduce a expert-curated diagnostic set, dataset{}. Our dataset includes specially modified variants of existing mathematical benchmarks, namely AIME and MATH500, as well as well-known puzzles deliberately redesigned to require deviation from familiar reasoning strategies. Using this dataset, we identify recurring contamination patterns that occur when models default to ingrained reasoning. Specifically, we categorize this contamination into three distinctive modes: (i) Interpretation Overload, (ii) Input Distrust, and (iii) Partial Instruction Attention, each causing models to ignore or distort provided instructions. We publicly release our diagnostic set to facilitate future research on mitigating reasoning rigidity in language models.

Problem

Research questions and friction points this paper is trying to address.

Models override user instructions due to reasoning rigidity

Rigidity causes incorrect conclusions in math and logic puzzles

Diagnostic set identifies three contamination patterns in reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Expert-curated diagnostic dataset for rigidity

Modified AIME and MATH500 benchmarks

Categorized contamination into three modes

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting