🤖 AI Summary
Reasoning models (e.g., DeepSeek-R1) exhibit pathological reasoning loops under low-temperature or greedy decoding when generating chain-of-thought (CoT) sequences, impeding complex problem solving.
Method: We identify two root causes—risk-averse behavioral biases induced by learning dynamics and error accumulation stemming from Transformer’s sequential modeling limitations. To isolate these factors, we introduce the novel concept of “learning error” and design a synthetic graph reasoning task that decouples problem difficulty from inductive bias. We further conduct rigorous analysis via probabilistic trajectory tracking, controlled decoding experiments, and knowledge distillation comparisons.
Contribution/Results: Empirical findings reveal that larger models exhibit fewer loops, whereas knowledge distillation significantly exacerbates looping behavior. Crucially, increasing temperature only broadens exploration breadth without mitigating underlying learning deficiencies. Our work provides both theoretical insight into loop formation mechanisms and empirical guidance for improving training paradigms and inference strategies in reasoning-capable LLMs.
📝 Abstract
Reasoning models (e.g., DeepSeek-R1) generate long chains of thought to solve harder problems, but they often loop, repeating the same text at low temperatures or with greedy decoding. We study why this happens and what role temperature plays. With open reasoning models, we find that looping is common at low temperature. Larger models tend to loop less, and distilled students loop significantly even when their teachers rarely do. This points to mismatches between the training distribution and the learned model, which we refer to as errors in learning, as a key cause. To understand how such errors cause loops, we introduce a synthetic graph reasoning task and demonstrate two mechanisms. First, risk aversion caused by hardness of learning: when the correct progress-making action is hard to learn but an easy cyclic action is available, the model puts relatively more probability on the cyclic action and gets stuck. Second, even when there is no hardness, Transformers show an inductive bias toward temporally correlated errors, so the same few actions keep being chosen and loops appear. Higher temperature reduces looping by promoting exploration, but it does not fix the errors in learning, so generations remain much longer than necessary at high temperature; in this sense, temperature is a stopgap rather than a holistic solution. We end with a discussion of training-time interventions aimed at directly reducing errors in learning.