🤖 AI Summary
This paper identifies the “Reasoning Theater Bias” (RTB) in large reasoning models (LRMs): an overreliance on verbose, formalistic—but semantically irrelevant—reasoning artifacts during automated evaluation, particularly undermining validity in subjective tasks. Contrary to prior assumptions, we find reasoning-specialized models exhibit *greater* susceptibility to RTB than general-purpose LLMs, with “shallow reasoning” emerging as the most pervasive bias form. To systematically study RTB, we introduce THEATER, a novel benchmark comprising six bias categories (e.g., spurious chain-of-thought, superficial cues), a task-dependent bias analysis framework, and mitigation strategies—including system prompting and self-reflection mechanisms. Experiments show our approach improves accuracy by 12% on factual tasks but only 1–3% on subjective ones, confirming RTB as a deep, task-sensitive challenge intrinsic to current LRM evaluation paradigms.
📝 Abstract
Large Reasoning Models (LRMs) like DeepSeek-R1 and o1 are increasingly used as automated evaluators, raising critical questions about their vulnerability to the aesthetics of reasoning in LLM-as-a-judge settings. We introduce THEATER, a comprehensive benchmark to systematically evaluate this vulnerability-termed Reasoning Theater Bias (RTB)-by comparing LLMs and LRMs across subjective preference and objective factual datasets. Through investigation of six bias types including Simple Cues and Fake Chain-of-Thought, we uncover three key findings: (1) in a critical paradox, reasoning-specialized LRMs are consistently more susceptible to RTB than general-purpose LLMs, particularly in subjective tasks; (2) this creates a task-dependent trade-off, where LRMs show more robustness on factual tasks but less on subjective ones; and (3) we identify 'shallow reasoning'-plausible but flawed arguments-as the most potent form of RTB. To address this, we design and evaluate two prompting strategies: a targeted system prompt that improves accuracy by up to 12% on factual tasks but only 1-3% on subjective tasks, and a self-reflection mechanism that shows similarly limited effectiveness in the more vulnerable subjective domains. Our work reveals that RTB is a deep-seated challenge for LRM-based evaluation and provides a systematic framework for developing more genuinely robust and trustworthy LRMs.