🤖 AI Summary
Large language models (LLMs) rely on long-chain reasoning for complex tasks, yet their autoregressive, left-to-right generation is vulnerable to error propagation from early incorrect tokens; existing self-reflection methods—such as full-draft revision or costly fine-tuning—are passive and inefficient. This paper introduces SRGen, a lightweight, test-time framework enabling fine-grained, on-the-fly self-reflection during generation: it dynamically identifies high-uncertainty tokens via entropy-based thresholds and performs localized probability distribution correction conditioned on already-generated context—requiring neither token re-generation nor additional training. SRGen is plug-and-play, computationally efficient, and agnostic to underlying training or inference strategies. Evaluated on mathematical reasoning benchmarks including AIME2024 with DeepSeek-R1-Distill-Qwen-7B, SRGen achieves absolute improvements of +12.0% in Pass@1 and +13.3% in Cons@5, substantially enhancing both single-sample output quality and self-consistency.
📝 Abstract
Large language models (LLMs) increasingly solve complex reasoning tasks via long chain-of-thought, but their forward-only autoregressive generation process is fragile; early token errors can cascade, which creates a clear need for self-reflection mechanisms. However, existing self-reflection either performs revisions over full drafts or learns self-correction via expensive training, both fundamentally reactive and inefficient. To address this, we propose Self-Reflective Generation at Test Time (SRGen), a lightweight test-time framework that reflects before generating at uncertain points. During token generation, SRGen utilizes dynamic entropy thresholding to identify high-uncertainty tokens. For each identified token, it trains a specific corrective vector, which fully exploits the already generated context for a self-reflective generation to correct the token probability distribution. By retrospectively analyzing the partial output, this self-reflection enables more trustworthy decisions, thereby significantly reducing the probability of errors at highly uncertain points. Evaluated on challenging mathematical reasoning benchmarks and a diverse set of LLMs, SRGen can consistently strengthen model reasoning: improvements in single-pass quality also translate into stronger self-consistency voting. Especially, on AIME2024 with DeepSeek-R1-Distill-Qwen-7B, SRGen yields absolute improvements of +12.0% on Pass@1 and +13.3% on Cons@5. Moreover, our findings position SRGen as a plug-and-play method that integrates reflection into the generation process for reliable LLM reasoning, achieving consistent gains with bounded overhead and broad composability with other training-time (e.g., RLHF) and test-time (e.g., SLOT) techniques.