🤖 AI Summary
This study systematically investigates the robustness of reasoning-oriented large language models when their chain-of-thought (CoT) reasoning is perturbed. By constructing a controlled evaluation framework, the authors apply seven types of interventions—benign, neutral, and adversarial—to model-generated CoT at fixed time steps, assessing recovery capabilities across mathematical, scientific, and logical reasoning tasks in multiple open-source models. The work reveals, for the first time, that CoT robustness is linked to model scale, intervention timing, and expressive style, identifying “skepticism” as a critical recovery mechanism. It further uncovers a trade-off between robustness and reasoning efficiency: early-stage interventions are more disruptive; rephrasing suppresses skeptical expressions and reduces accuracy; and neutral or adversarial perturbations increase CoT length by over 200%.
📝 Abstract
Reasoning LLMs (RLLMs) generate step-by-step chains of thought (CoTs) before giving an answer, which improves performance on complex tasks and makes reasoning more transparent. But how robust are these reasoning traces to disruptions that occur within them? To address this question, we introduce a controlled evaluation framework that perturbs a model's own CoT at fixed timesteps. We design seven interventions (benign, neutral, and adversarial) and apply them to multiple open-weight RLLMs across Math, Science, and Logic tasks. Our results show that RLLMs are generally robust, reliably recovering from diverse perturbations, with robustness improving with model size and degrading when interventions occur early. However, robustness is not style-invariant: paraphrasing suppresses doubt-like expressions and reduces performance, while other interventions trigger doubt and support recovery. Recovery also carries a cost: neutral and adversarial noise can inflate CoT length by more than 200%, whereas paraphrasing shortens traces but harms accuracy. These findings provide new evidence on how RLLMs maintain reasoning integrity, identify doubt as a central recovery mechanism, and highlight trade-offs between robustness and efficiency that future training methods should address.