First Try Matters: Revisiting the Role of Reflection in Reasoning Models

📅 2025-10-09

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work investigates the practical role of “reflection” in large language model (LLM) reasoning. We find that reflection predominantly serves to confirm initial answers rather than correct errors, and performance gains stem primarily from improved first-answer accuracy—not post-hoc correction. To address reflection redundancy, we propose a problem-aware early-stopping mechanism that dynamically truncates unnecessary reflection steps. We further construct a multi-step reflection dataset via supervised fine-tuning and optimize reasoning trajectories through inference-path analysis. Experiments across five mathematical reasoning benchmarks show our method reduces average inference token usage by 24.5% with only a 2.9% accuracy drop, significantly improving inference efficiency. Our core contributions are: (1) an empirical characterization revealing reflection’s predominant role as answer confirmation rather than error correction; and (2) the first dynamic truncation framework explicitly designed to mitigate reflection redundancy in LLM reasoning.

Technology Category

Application Category

📝 Abstract

Large language models have recently demonstrated significant gains in reasoning ability, often attributed to their capacity to generate longer chains of thought and engage in reflective reasoning. However, the contribution of reflections to performance improvement remains unclear. In this paper, we systematically analyze the rollouts of eight reasoning models on five mathematical datasets. We focus on reflective behaviours where the model has already produced an answer but continues reflecting before finalizing its output. Our analysis reveals that reflections are predominantly confirmatory and rarely alter the model's initial answer, a pattern consistent across models and datasets. To understand the role of reflections in training, we construct supervised fine-tuning (SFT) datasets with varying amounts of reflection steps. We observe that training models on rollouts with more reflection steps primarily enhances first-answer correctness rather than the ability to correct initially wrong answers through reflections. This motivates us to propose a question-aware early-stopping method that enhances inference-time token efficiency by stopping the reasoning process once a few plausible candidate answers are generated, thereby reducing unnecessary reflection steps. Motivated by this, we further propose to dynamically truncate the reflections after a candidate answer has appeared during generation, which reduces reasoning tokens by 24.5% across five mathematical datasets, within a 2.9% drop in accuracy.

Problem

Research questions and friction points this paper is trying to address.

Analyzing the limited effectiveness of reflection in improving reasoning model accuracy

Investigating how reflection training primarily enhances first-answer correctness

Developing methods to reduce unnecessary reflection steps while maintaining accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzes reflection impact on reasoning models systematically

Proposes question-aware early-stopping for token efficiency

Dynamically truncates reflections after candidate answer generation

🔎 Similar Papers

No similar papers found.

Authors to Follow