bViT: Investigating Single-Block Recurrence in Vision Transformers for Image Recognition

πŸ“… 2026-05-11
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

200K/year
πŸ€– AI Summary
This work investigates whether the depth of Vision Transformers (ViTs) inherently requires layer-wise independent parameterization and proposes bViT, a novel architecture that recursively reuses a single Transformer block. By substantially widening the representation space and integrating parameter-efficient fine-tuning, bViT maintains the iterative structure of ViTs while drastically reducing the total number of parameters. Empirical results demonstrate that, within a sufficiently wide feature space, most of a ViT’s effective depth can be achieved through recursion, revealing an implicit depth multiplexing phenomenon. Remarkably, bViT-B attains performance on ImageNet-1K comparable to standard ViT-B using only approximately one-tenth of its parameters and 12 recursive steps, while also achieving strong results on various downstream tasks.
πŸ“ Abstract
Vision Transformers (ViTs) are built by stacking independently parameterized blocks, but it remains unclear how much of this depth requires layer specific transformations and how much can be realized through recurrent computation. We study this question with bViT, a single-block recurrent ViT in which one transformer block is applied repeatedly to process an image. This architecture preserves the iterative structure of a deep ViT while removing layer specific block parameterization, providing a controlled setting for studying recurrence in vision. On ImageNet-1K, a 12-step bViT-B achieves accuracy comparable to standard ViT-B under the same training recipe and computational budget, while using an order of magnitude fewer parameters. We observe that recurrent performance improves with representation width, with wider bViTs recovering much more of the performance of standard ViTs than narrow variants. We interpret this behavior as implicit depth multiplexing, where a shared block expresses multiple step-dependent computations through the evolving hidden state. Beyond ImageNet classification, bViT transfers competitively to downstream tasks and enables parameter-efficient fine-tuning. Mechanistic analyses of activations, attention and step-specific pruning show that the shared block changes its effective behavior across recurrent steps rather than simply repeating the same computation. Our results suggest that a large fraction of ViT depth can be implemented through recurrent reuse, provided that the representation space is sufficiently wide.
Problem

Research questions and friction points this paper is trying to address.

Vision Transformers
recurrence
parameter efficiency
image recognition
depth
Innovation

Methods, ideas, or system contributions that make the work stand out.

recurrent Vision Transformer
parameter efficiency
implicit depth multiplexing
single-block architecture
representation width