Reasoning's Razor: Reasoning Improves Accuracy but Can Hurt Recall at Critical Operating Points in Safety and Hallucination Detection

📅 2025-10-23

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This work investigates the impact of reasoning on precision-critical safety tasks—such as safety detection and hallucination detection—in large language models (LLMs), under strict low false positive rate (FPR) constraints. We systematically compare “Think On” (reasoning enabled) versus “Think Off” (reasoning disabled) modes across fine-tuned and zero-shot classification settings, introducing token-level scoring and dual-mode ensemble strategies. Our key findings are threefold: (1) In the critical low-FPR regime, Think Off significantly outperforms Think On—notwithstanding reasoning’s overall accuracy gains—due to severe recall degradation induced by reasoning; (2) Token-level quantitative confidence scores substantially surpass self-reported model confidence; and (3) A simple ensemble of Think On and Think Off achieves superior trade-offs between precision and recall. This study establishes a more robust, deployable detection paradigm for safety-sensitive applications.

Technology Category

Application Category

📝 Abstract

Reasoning has become a central paradigm for large language models (LLMs), consistently boosting accuracy across diverse benchmarks. Yet its suitability for precision-sensitive tasks remains unclear. We present the first systematic study of reasoning for classification tasks under strict low false positive rate (FPR) regimes. Our analysis covers two tasks--safety detection and hallucination detection--evaluated in both fine-tuned and zero-shot settings, using standard LLMs and Large Reasoning Models (LRMs). Our results reveal a clear trade-off: Think On (reasoning-augmented) generation improves overall accuracy, but underperforms at the low-FPR thresholds essential for practical use. In contrast, Think Off (no reasoning during inference) dominates in these precision-sensitive regimes, with Think On surpassing only when higher FPRs are acceptable. In addition, we find token-based scoring substantially outperforms self-verbalized confidence for precision-sensitive deployments. Finally, a simple ensemble of the two modes recovers the strengths of each. Taken together, our findings position reasoning as a double-edged tool: beneficial for average accuracy, but often ill-suited for applications requiring strict precision.

Problem

Research questions and friction points this paper is trying to address.

Evaluating reasoning's impact on classification under low false positive rates

Analyzing safety and hallucination detection trade-offs with reasoning models

Identifying reasoning limitations in precision-sensitive deployment scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reasoning-augmented generation improves overall accuracy

No-reasoning inference dominates at low false positive rates

Ensemble of reasoning modes recovers their respective strengths

🔎 Similar Papers

Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?