Block-Recurrent Dynamics in Vision Transformers

📅 2025-12-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of interpretability in Vision Transformer (ViT) depth-wise computation by proposing and empirically validating the Block-Recurrent Hypothesis (BRH): post-training ViTs admit a reparameterization wherein the L-layer forward pass reduces to recurrent execution of only k ≪ L reusable modules. Leveraging inter-layer representation similarity analysis, Raptor-based recurrent surrogate modeling, DINOv2 transfer learning, and phase-space trajectory modeling, we provide the first empirical evidence that ViT deep dynamics exhibit a compact recurrent structure—achieving 96% ImageNet-1K linear probe accuracy using merely two recurrent blocks. The study uncovers intrinsic dynamical patterns: class-dependent angular basin convergence, token-specific evolution, and late-stage low-rank updates. Collectively, these findings establish a novel “depth-as-dynamics” interpretability paradigm, furnishing both theoretical foundations and empirical support for ViT interpretability and efficient reparameterization.

Technology Category

Application Category

📝 Abstract
As Vision Transformers (ViTs) become standard vision backbones, a mechanistic account of their computational phenomenology is essential. Despite architectural cues that hint at dynamical structure, there is no settled framework that interprets Transformer depth as a well-characterized flow. In this work, we introduce the Block-Recurrent Hypothesis (BRH), arguing that trained ViTs admit a block-recurrent depth structure such that the computation of the original $L$ blocks can be accurately rewritten using only $k ll L$ distinct blocks applied recurrently. Across diverse ViTs, between-layer representational similarity matrices suggest few contiguous phases. To determine whether these phases reflect genuinely reusable computation, we train block-recurrent surrogates of pretrained ViTs: Recurrent Approximations to Phase-structured TransfORmers (Raptor). In small-scale, we demonstrate that stochastic depth and training promote recurrent structure and subsequently correlate with our ability to accurately fit Raptor. We then provide an empirical existence proof for BRH by training a Raptor model to recover $96%$ of DINOv2 ImageNet-1k linear probe accuracy in only 2 blocks at equivalent computational cost. Finally, we leverage our hypothesis to develop a program of Dynamical Interpretability. We find i) directional convergence into class-dependent angular basins with self-correcting trajectories under small perturbations, ii) token-specific dynamics, where cls executes sharp late reorientations while patch tokens exhibit strong late-stage coherence toward their mean direction, and iii) a collapse to low rank updates in late depth, consistent with convergence to low-dimensional attractors. Altogether, we find a compact recurrent program emerges along ViT depth, pointing to a low-complexity normative solution that enables these models to be studied through principled dynamical systems analysis.
Problem

Research questions and friction points this paper is trying to address.

Characterizes block-recurrent depth structure in Vision Transformers
Trains recurrent surrogates to approximate multi-block computations efficiently
Enables dynamical interpretability by analyzing convergence and token-specific trajectories
Innovation

Methods, ideas, or system contributions that make the work stand out.

Recurrent block structure reduces ViT depth complexity
Raptor model approximates transformers with fewer blocks
Dynamical interpretability reveals low-dimensional attractor dynamics
🔎 Similar Papers
No similar papers found.