Thinking Deeper, Not Longer: Depth-Recurrent Transformers for Compositional Generalization

📅 2026-03-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation of standard Transformers in generalizing to tasks requiring variable-depth reasoning—such as multi-hop graph traversal and nested logical inference—due to their fixed computational depth. The authors propose the Deep Recurrent Transformer, which decouples computation depth from parameter count by iteratively reusing a shared-weight Transformer block in latent space, enabling on-demand adjustment of reasoning steps. Through silent-thinking supervision, LayerScale initialization, and an identity-bias recurrence mechanism, the model achieves stable operation over 20+ recurrent steps for the first time, revealing a “computational frontier” between task complexity and required thinking steps. Experiments on graph reachability, nested Boolean logic, and unstructured relational reasoning demonstrate consistent performance gains with increased reasoning depth, showcasing a shift from brittle precision to autonomous robustness in generalization.

Technology Category

Application Category

📝 Abstract
Standard Transformers have a fixed computational depth, fundamentally limiting their ability to generalize to tasks requiring variable-depth reasoning, such as multi-hop graph traversal or nested logic. We propose a depth-recurrent Transformer that decouples computational depth from parameter count by iteratively applying a shared-weight Transformer block in latent space -- enabling the model to trade recurrence steps for deeper reasoning at inference time. Our architecture incorporates three mechanisms to make deep recurrence (20+ steps) stable: (1) a silent thinking objective that supervises only the final output, forcing genuine multi-step reasoning rather than intermediate heuristic shortcuts; (2) LayerScale initialization to protect fragile reasoning states from untrained layer noise; and (3) an identity-biased recurrence that creates a gradient highway across many steps. We evaluate on three compositional reasoning domains with decreasing inductive biases: graph reachability (strict adjacency masking), nested boolean logic (relative positioning), and unstructured relational text (where sequence position provides no structural hints). Across all tasks, we observe a clear \emph{computational frontier} -- a boundary where performance transitions from chance to near-perfect as thinking steps scale with task complexity. Moreover, these tasks reveal qualitatively different generalization behaviors: precise but brittle (graph), approximate but robust (logic), and autonomous latent routing without structural hints (text). This progression illuminates how the interplay between a task-invariant recurrent reasoning core and task-specific perceptual interfaces shapes out-of-distribution (OOD) generalization, offering a mechanistic perspective on vertical chain-of-thought that complements the prevailing horizontal token-generation paradigm.
Problem

Research questions and friction points this paper is trying to address.

compositional generalization
computational depth
reasoning
out-of-distribution generalization
variable-depth reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Depth-Recurrent Transformers
Compositional Generalization
Silent Thinking Objective
LayerScale Initialization
Gradient Highway
🔎 Similar Papers
No similar papers found.