🤖 AI Summary
This work addresses the critical gap in evaluating the robustness of reasoning models under realistic noisy conditions—such as irrelevant documents, chat histories, and strong negative examples—where current benchmarks fall short. To this end, we introduce NoisyBench, the first systematic benchmark assessing model resilience across 11 tasks spanning RAG, reasoning, alignment, and tool use. Our analysis uncovers several counterintuitive findings: context noise can degrade performance by up to 80%, agent workflows amplify errors, and increased test-time computation may harm accuracy. We further propose Rationale-Aware Reward (RARE), a reinforcement learning method that steers models toward valid reasoning traces, significantly enhancing noise robustness. Experiments show that conventional approaches—including prompt engineering, supervised fine-tuning, and outcome-based reward RL—fail to improve resilience, whereas RARE effectively mitigates over-attention to distractor tokens, offering key insights for building robust reasoning agents.
📝 Abstract
Recent advances in reasoning models and agentic AI systems have led to an increased reliance on diverse external information. However, this shift introduces input contexts that are inherently noisy, a reality that current sanitized benchmarks fail to capture. We introduce NoisyBench, a comprehensive benchmark that systematically evaluates model robustness across 11 datasets in RAG, reasoning, alignment, and tool-use tasks against diverse noise types, including random documents, irrelevant chat histories, and hard negative distractors. Our evaluation reveals a catastrophic performance drop of up to 80% in state-of-the-art models when faced with contextual distractors. Crucially, we find that agentic workflows often amplify these errors by over-trusting noisy tool outputs, and distractors can trigger emergent misalignment even without adversarial intent. We find that prompting, context engineering, SFT, and outcome-reward only RL fail to ensure robustness; in contrast, our proposed Rationale-Aware Reward (RARE) significantly strengthens resilience by incentivizing the identification of helpful information within noise. Finally, we uncover an inverse scaling trend where increased test-time computation leads to worse performance in noisy settings and demonstrate via attention visualization that models disproportionately focus on distractor tokens, providing vital insights for building the next generation of robust, reasoning-capable agents.