Thinking Deeper, Not Longer: Depth-Recurrent Transformers for Compositional Generalization

📅 2026-03-23

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

This work addresses the limitation of standard Transformers in generalizing to tasks requiring variable-depth reasoning—such as multi-hop graph traversal and nested logical inference—due to their fixed computational depth. The authors propose the Deep Recurrent Transformer, which decouples computation depth from parameter count by iteratively reusing a shared-weight Transformer block in latent space, enabling on-demand adjustment of reasoning steps. Through silent-thinking supervision, LayerScale initialization, and an identity-bias recurrence mechanism, the model achieves stable operation over 20+ recurrent steps for the first time, revealing a “computational frontier” between task complexity and required thinking steps. Experiments on graph reachability, nested Boolean logic, and unstructured relational reasoning demonstrate consistent performance gains with increased reasoning depth, showcasing a shift from brittle precision to autonomous robustness in generalization.

Technology Category

Application Category

📝 Abstract

Standard Transformers have a fixed computational depth, fundamentally limiting their ability to generalize to tasks requiring variable-depth reasoning, such as multi-hop graph traversal or nested logic. We propose a depth-recurrent Transformer that decouples computational depth from parameter count by iteratively applying a shared-weight Transformer block in latent space -- enabling the model to trade recurrence steps for deeper reasoning at inference time. Our architecture incorporates three mechanisms to make deep recurrence (20+ steps) stable: (1) a silent thinking objective that supervises only the final output, forcing genuine multi-step reasoning rather than intermediate heuristic shortcuts; (2) LayerScale initialization to protect fragile reasoning states from untrained layer noise; and (3) an identity-biased recurrence that creates a gradient highway across many steps. We evaluate on three compositional reasoning domains with decreasing inductive biases: graph reachability (strict adjacency masking), nested boolean logic (relative positioning), and unstructured relational text (where sequence position provides no structural hints). Across all tasks, we observe a clear \emph{computational frontier} -- a boundary where performance transitions from chance to near-perfect as thinking steps scale with task complexity. Moreover, these tasks reveal qualitatively different generalization behaviors: precise but brittle (graph), approximate but robust (logic), and autonomous latent routing without structural hints (text). This progression illuminates how the interplay between a task-invariant recurrent reasoning core and task-specific perceptual interfaces shapes out-of-distribution (OOD) generalization, offering a mechanistic perspective on vertical chain-of-thought that complements the prevailing horizontal token-generation paradigm.

Problem

Research questions and friction points this paper is trying to address.

compositional generalization

computational depth

reasoning

out-of-distribution generalization

variable-depth reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Depth-Recurrent Transformers

Compositional Generalization

Silent Thinking Objective