Loop, Think, & Generalize: Implicit Reasoning in Recurrent-Depth Transformers

📅 2026-04-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models struggle to perform implicit multi-hop reasoning in a single forward pass and lack compositional generalization over parametric knowledge. This work proposes a Recursive Deep Transformer architecture that enhances systematic generalization and deep extrapolation by iteratively computing within the same layer. The study reveals that systematic generalization emerges through a three-stage “insight” process and demonstrates that increasing the number of recursive steps at inference time unlocks reasoning performance beyond the training depth. The authors also identify “overthinking” as a critical bottleneck limiting further gains. Experimental results show that, compared to standard Transformers, the proposed model achieves substantially improved generalization on unseen knowledge compositions and longer reasoning chains.
📝 Abstract
We study implicit reasoning, i.e. the ability to combine knowledge or rules within a single forward pass. While transformer-based large language models store substantial factual knowledge and rules, they often fail to compose this knowledge for implicit multi-hop reasoning, suggesting a lack of compositional generalization over their parametric knowledge. To address this limitation, we study recurrent-depth transformers, which enables iterative computation over the same transformer layers. We investigate two compositional generalization challenges under the implicit reasoning scenario: systematic generalization, i.e. combining knowledge that is never used for compositions during training, and depth extrapolation, i.e. generalizing from limited reasoning depth (e.g. training on up to 5-hop) to deeper compositions (e.g. 10-hop). Through controlled studies with models trained from scratch, we show that while vanilla transformers struggle with both generalization challenges, recurrent-depth transformers can effectively make such generalization. For systematic generalization, we find that this ability emerges through a three-stage grokking process, transitioning from memorization to in-distribution generalization and finally to systematic generalization, supported by mechanistic analysis. For depth extrapolation, we show that generalization beyond training depth can be unlocked by scaling inference-time recurrence, with more iterations enabling deeper reasoning. We further study how training strategies affect extrapolation, providing guidance on training recurrent-depth transformers, and identify a key limitation, overthinking, where excessive recurrence degrades predictions and limits generalization to very deep compositions.
Problem

Research questions and friction points this paper is trying to address.

implicit reasoning
compositional generalization
systematic generalization
depth extrapolation
multi-hop reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

recurrent-depth transformers
implicit reasoning
compositional generalization
depth extrapolation
grokking