One Model, Two Roles: Emergent Specialization in a Shared Recurrent Transformer

📅 2026-05-17

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

This study investigates whether recurrent Transformers with shared weights can spontaneously develop complementary functional roles without explicit modular decomposition. By introducing an Asymmetric Input Recurrence architecture—in which inputs are injected only during L-updates but not H-updates—the authors demonstrate that asymmetry in update rules alone is sufficient to induce stable functional differentiation. Through state freezing, attention analysis, and ablation studies on Sudoku-Extreme and Maze tasks, they observe that zH emerges as a deterministic proposal state while zL retains local uncertainty. Attention patterns further reveal that L-updates focus on local context, whereas H-updates attend globally. This work provides the first evidence that internal role specialization in recurrent Transformers can arise through a simple self-organizing mechanism.

📝 Abstract

Can a shared-weight recurrent Transformer develop distinct internal roles without being partitioned into separate modules? We study this in Asymmetric Input Recurrence (AIR), a minimal two-state reasoning architecture in which the same Transformer model is reused for both updates (per literature, L and H) and the only built-in difference in the update rule is that the encoded input is injected during L-updates but not H-updates. Across Sudoku-Extreme and Maze, decoded rollouts reveal a consistent split: $\zH$ behaves like a fully committed proposal state, whereas $\zL$ retains local uncertainty and shifting intermediate structure. Freeze experiments show that this split is, in practice, related to the model's state dynamics: in Sudoku, freezing $\zH$ reduces $\zL$'s content changes whereas freezing $\zL$ increases $\zH$'s, while in Maze, freezing either state increases content changes in the other state. Ablations show that to induce specialization, the shared model needs to be able to tell the two update types apart, either from input injection asymmetry or from a separate level token. Mechanistically, attention analysis shows that L-updates are consistently more local than H-updates in both Sudoku and Maze. Together, these results show that, in a two-state recurrent setting, a clear state-identity signal can induce stable, related functional roles inside a shared-parameter recurrent Transformer. Code is available at \href{https://github.com/juchengshen/air}{\textcolor{blue}{https://github.com/juchengshen/air}}.

Problem

Research questions and friction points this paper is trying to address.

Recurrent Transformer

Emergent Specialization

Shared-weight Model

Asymmetric Input Recurrence

Functional Roles

Innovation

Methods, ideas, or system contributions that make the work stand out.

Recurrent Transformer

Emergent Specialization

Asymmetric Input Recurrence