The Emperor's New Chain-of-Thought: Probing Reasoning Theater Bias in Large Reasoning Models

📅 2025-07-18

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This paper identifies the “Reasoning Theater Bias” (RTB) in large reasoning models (LRMs): an overreliance on verbose, formalistic—but semantically irrelevant—reasoning artifacts during automated evaluation, particularly undermining validity in subjective tasks. Contrary to prior assumptions, we find reasoning-specialized models exhibit *greater* susceptibility to RTB than general-purpose LLMs, with “shallow reasoning” emerging as the most pervasive bias form. To systematically study RTB, we introduce THEATER, a novel benchmark comprising six bias categories (e.g., spurious chain-of-thought, superficial cues), a task-dependent bias analysis framework, and mitigation strategies—including system prompting and self-reflection mechanisms. Experiments show our approach improves accuracy by 12% on factual tasks but only 1–3% on subjective ones, confirming RTB as a deep, task-sensitive challenge intrinsic to current LRM evaluation paradigms.

Technology Category

Application Category

📝 Abstract

Large Reasoning Models (LRMs) like DeepSeek-R1 and o1 are increasingly used as automated evaluators, raising critical questions about their vulnerability to the aesthetics of reasoning in LLM-as-a-judge settings. We introduce THEATER, a comprehensive benchmark to systematically evaluate this vulnerability-termed Reasoning Theater Bias (RTB)-by comparing LLMs and LRMs across subjective preference and objective factual datasets. Through investigation of six bias types including Simple Cues and Fake Chain-of-Thought, we uncover three key findings: (1) in a critical paradox, reasoning-specialized LRMs are consistently more susceptible to RTB than general-purpose LLMs, particularly in subjective tasks; (2) this creates a task-dependent trade-off, where LRMs show more robustness on factual tasks but less on subjective ones; and (3) we identify 'shallow reasoning'-plausible but flawed arguments-as the most potent form of RTB. To address this, we design and evaluate two prompting strategies: a targeted system prompt that improves accuracy by up to 12% on factual tasks but only 1-3% on subjective tasks, and a self-reflection mechanism that shows similarly limited effectiveness in the more vulnerable subjective domains. Our work reveals that RTB is a deep-seated challenge for LRM-based evaluation and provides a systematic framework for developing more genuinely robust and trustworthy LRMs.

Problem

Research questions and friction points this paper is trying to address.

Evaluating vulnerability of Large Reasoning Models to reasoning aesthetics

Investigating bias types like Fake Chain-of-Thought in model evaluations

Addressing shallow reasoning and bias in subjective and factual tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark THEATER evaluates Reasoning Theater Bias

Prompting strategies improve factual task accuracy

Self-reflection mechanism addresses shallow reasoning flaws

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting