Large Language Models Decide Early and Explain Later

📅 2026-04-24
📈 Citations: 0
Influential: 0
📄 PDF

career value

173K/year
🤖 AI Summary
This work addresses the issue of excessive computational overhead in chain-of-thought reasoning with large language models, which often generate numerous redundant intermediate steps. The authors propose an early-stopping strategy based on forced answer completion and answer stability detection, which monitors the evolution of predicted answers throughout the reasoning process. Their analysis reveals that models frequently converge to their final answer well before completing the full reasoning trace. Experimental results on models such as Qwen3-4B demonstrate that this approach reduces the average number of inference tokens per query by approximately 500, with only a 2% drop in accuracy. These findings confirm the presence of safely removable redundancy in chain-of-thought reasoning and offer an effective solution for more efficient inference.

Technology Category

Application Category

📝 Abstract
Large Language Models often achieve strong performance by generating long intermediate chain-of-thought reasoning. However, it remains unclear when a model's final answer is actually determined during generation. If the answer is already fixed at an intermediate stage, subsequent reasoning tokens may constitute post-decision explanation, increasing inference cost and latency without improving correctness. We study the evolution of predicted answers over reasoning steps using forced answer completion, which elicits the model's intermediate predictions at partial reasoning prefixes. Focusing on Qwen3-4B and averaging results across all datasets considered, we find that predicted answers change in only 32% of queries. Moreover, once the final answer switch occurs, the model generates an average of 760 additional reasoning tokens per query, accounting for a substantial fraction of the total reasoning budget. Motivated by these findings, we investigate early stopping strategies that halt generation once the answer has stabilized. We show that simple heuristics, including probe-based stopping, can reduce reasoning token usage by 500 tokens per query while incurring only a 2% drop in accuracy. Together, our results indicate that a large portion of chain-of-thought generation is redundant and can be reduced with minimal impact on performance.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Chain-of-Thought Reasoning
Early Answer Determination
Redundant Generation
Inference Efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

early stopping
chain-of-thought reasoning
forced answer completion
answer stabilization
inference efficiency