What Makes Looped Transformers Perform Better Than Non-Recursive Ones (Provably)

📅 2025-10-11

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

Why does Looped-Attention (Looped-Attn) outperform standard Transformers on complex reasoning tasks? This paper provides the first theoretical explanation from the perspective of loss landscape geometry: Looped-Attn’s recursive structure induces a sharper, V-shaped valley—rather than a shallow U-shaped one—in the loss landscape, enhancing modeling capacity for intricate patterns and accelerating convergence. To formalize this insight, we propose a novel loss-landscape taxonomy that bridges Hessian-based curvature analysis with sample-wise dynamic evolution, enabling the design of SHIFT—a staged, progressive training method. We theoretically prove that Looped-Attn possesses an intrinsic V-shaped valley inductive bias, guaranteeing superior optimization properties and convergence guarantees. Empirically, SHIFT significantly reduces training time while preserving model performance. This work establishes the first unified, loss-geometry–based theoretical foundation for recurrent attention architectures.

Technology Category

Application Category

📝 Abstract

While looped transformers (termed as Looped-Attn) often outperform standard transformers (termed as Single-Attn) on complex reasoning tasks, the theoretical basis for this advantage remains underexplored. In this paper, we explain this phenomenon through the lens of loss landscape geometry, inspired by empirical observations of their distinct dynamics at both sample and Hessian levels. To formalize this, we extend the River-Valley landscape model by distinguishing between U-shaped valleys (flat) and V-shaped valleys (steep). Based on empirical observations, we conjecture that the recursive architecture of Looped-Attn induces a landscape-level inductive bias towards River-V-Valley. Theoretical derivations based on this inductive bias guarantee a better loss convergence along the river due to valley hopping, and further encourage learning about complex patterns compared to the River-U-Valley induced by Single-Attn. Building on this insight, we propose SHIFT (Staged HIerarchical Framework for Progressive Training), a staged training framework that accelerates the training process of Looped-Attn while achieving comparable performances.

Problem

Research questions and friction points this paper is trying to address.

Explaining looped transformers' superior reasoning performance theoretically

Analyzing loss landscape geometry differences between recursive and standard architectures

Proposing staged training framework to accelerate looped transformer optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Looped transformers induce River-V-Valley landscape bias

SHIFT framework enables staged progressive training

Valley hopping accelerates loss convergence in loops

🔎 Similar Papers

On Expressive Power of Looped Transformers: Theoretical Analysis and Enhancement via Timestep Encoding