Depth-Recurrent Attention Mixtures: Giving Latent Reasoning the Attention it Deserves

📅 2026-01-29
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Existing deep recurrent models are constrained by fixed hidden dimensions and rigid stacked architectures, and they lack baselines aligned across FLOPs, parameters, and memory, hindering efficient multi-step implicit reasoning. This work proposes Dreamer, a modular framework that, for the first time, integrates deep attention with a sparse mixture-of-experts (MoE) mechanism. By introducing attention along the depth dimension, Dreamer decouples model scaling factors and overcomes the limitation of constant hidden size. The approach substantially enhances expert selection diversity (by 2–11×) and knowledge utilization efficiency. On language reasoning benchmarks, Dreamer achieves comparable accuracy with 2–8× fewer training tokens and outperforms current state-of-the-art models of approximately twice its size under the same training budget.

Technology Category

Application Category

📝 Abstract
Depth-recurrence facilitates latent reasoning by sharing parameters across depths. However, prior work lacks combined FLOP-, parameter-, and memory-matched baselines, underutilizes depth-recurrence due to partially fixed layer stacks, and ignores the bottleneck of constant hidden-sizes that restricts many-step latent reasoning. To address this, we introduce a modular framework of depth-recurrent attention mixtures (Dreamer), combining sequence attention, depth attention, and sparse expert attention. It alleviates the hidden-size bottleneck through attention along depth, decouples scaling dimensions, and allows depth-recurrent models to scale efficiently and effectively. Across language reasoning benchmarks, our models require 2 to 8x fewer training tokens for the same accuracy as FLOP-, parameter-, and memory-matched SOTA, and outperform ca. 2x larger SOTA models with the same training tokens. We further present insights into knowledge usage across depths, e.g., showing 2 to 11x larger expert selection diversity than SOTA MoEs.
Problem

Research questions and friction points this paper is trying to address.

depth-recurrence
latent reasoning
hidden-size bottleneck
parameter sharing
memory efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

depth-recurrent attention
latent reasoning
sparse expert attention
parameter-efficient scaling
attention along depth
🔎 Similar Papers
No similar papers found.
J
Jonas Knupp
Aleph Alpha Research, Heidelberg, Germany; Lab1141, Germany; Work started at Technical University of Munich, Germany
J
J. Metzen
Aleph Alpha Research, Heidelberg, Germany
J
Jeremias Bohn
Research Group Social Computing, Technical University of Munich, Germany
Georg Groh
Georg Groh
Adjunct Professor
Social ComputingNatural Language Processing
Kristian Kersting
Kristian Kersting
Professor of AI & ML, Technical University of Darmstadt, Hessian.ai, DFKI, CAIRNE/ELLIS, AAAI Fellow
Artificial IntelligenceNeurosymbolic AIProbabilistic CircuitsMachine Learning