🤖 AI Summary
Large language models excel on standard mathematical benchmarks yet exhibit fragile reasoning when subjected to textual perturbations. This work introduces the Robust Reasoning Benchmark (RRB), comprising 13 deterministic perturbations applied to AIME 2024–2025 problems, to systematically evaluate models ranging from 7B to 120B parameters under structural noise. The study reveals, for the first time, multiple failure modes of open-source reasoning models in multi-problem sequential settings and identifies a phenomenon termed “intra-query attention dilution,” highlighting a fundamental limitation of dense attention mechanisms in chain-of-thought reasoning. Experiments demonstrate that open-source models suffer accuracy drops of up to 54%—reaching 100% in some cases—with pronounced performance degradation on subsequent tasks. In contrast, Claude consistently rejects perturbed prompts, corroborating the presence of attention contamination effects.
📝 Abstract
While Large Language Models (LLMs) achieve high performance on standard mathematical benchmarks, their underlying reasoning processes remain highly overfit to standard textual formatting. We propose a perturbation pipeline consisting of 14 techniques to evaluate robustness of LLM reasoning. We apply this pipeline to AIME 2024 dataset and evalute 8 state-of-the-art models on the resulting benchmark. While frontier models exhibit resilience, open weights reasoning models suffer catastrophic collapses (up to 55% average accuracy drops across perturbations and up to 100% on some), exposing structural fragility. To further disentangle mechanical parsing failures from downstream reasoning failures, we strictly isolate the models'working memory capacity by forcing models to solve multiple unperturbed mathematical problems sequentially within a single context window. Our results indicate that open weight models ranging from 7B to 120B parameters and Claude Opus 4.6 exhibit accuracy decay on subsequent problems. This degradation demonstrates that intermediate reasoning steps permanently pollute standard dense attention mechanisms. We argue that to achieve reliable reasoning, future reasoning architectures must integrate explicit contextual resets within a model's own Chain-of-Thought, leading to fundamental open questions regarding the optimal granularity of atomic reasoning tasks.