Stateful Reasoning via Insight Replay

📅 2026-05-14

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This work addresses the degradation in performance observed in long-chain-of-thought (CoT) reasoning, where language models increasingly lose focus on early critical insights as the reasoning sequence lengthens. To mitigate this information decay, the authors propose InsightReplay, a novel stateful reasoning mechanism that dynamically identifies key intermediate insights and periodically replays them to the generation frontier. By integrating attention analysis, salient information extraction, and contextual replay within large language models, InsightReplay effectively counteracts forgetting during extended inference. Evaluated across 24 experimental settings, the method consistently improves accuracy, yielding an average gain of 1.65 percentage points and achieving up to a 9.2-point improvement on individual tasks.

📝 Abstract

Chain-of-Thought (CoT) reasoning has become a foundation for eliciting multi-step reasoning in large language models, but recent studies show that its benefits do not scale monotonically with chain length: while longer CoT generally enables a model to tackle harder problems, on a given problem, accuracy typically increases with CoT length up to a point, after which it declines. We identify a major cause of this phenomenon: as the CoT grows, the model's attention to critical insights produced earlier in the trace gradually weakens, making those insights progressively less accessible when they are most needed. Therefore, we propose \textbf{InsightReplay}, a stateful reasoning approach in which the model periodically extracts critical insights from its reasoning trace and replays them near the active generation frontier, keeping them accessible as the reasoning scales. Extensive experiments on a $\mathbf{2}\!\times\!\mathbf{3}\!\times\!\mathbf{4}$ benchmark grid, covering model scales $\{\text{8B}, \text{30B}\}$, model families $\{\text{Qwen3.5}, \text{DeepSeek-R1-Distill-Qwen}, \text{Gemma-4}\}$, and reasoning benchmarks $\{\text{AIME}, \text{HMMT}, \text{GPQA Diamond}, \text{LiveCodeBench v5}\}$, show that 3-round InsightReplay yields accuracy gains across \textbf{all 24 settings}, with an averaged improvement of $\mathbf{+1.65}$ points over standard CoT, and a largest single-setting gain of $\mathbf{+9.2}$ points on R1-Distill-32B's LiveCodeBench v5 subset. Our results suggest that the effectiveness of test-time scaling depends not only on how much a model reasons, but also on whether critical intermediate insights remain accessible throughout long reasoning trajectories.

Problem

Research questions and friction points this paper is trying to address.

Chain-of-Thought

reasoning

insight accessibility

stateful reasoning

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

InsightReplay

Chain-of-Thought

stateful reasoning