Rethinking Vision Transformer Depth via Structural Reparameterization

📅 2025-11-24

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

To address the prohibitively high computational cost of deeply stacked Vision Transformers (ViTs), this paper proposes a depth-compression framework based on branch-level structural reparameterization. During training, it employs a parallel multi-branch architecture with progressive fusion at the inputs of both feed-forward network (FFN) and multi-head self-attention (MHSA) modules; crucially, nonlinear paths undergo rigorous mathematical reparameterization to ensure equivalence. At inference, all branches are merged into a single path without any accuracy degradation. This work is the first to introduce structural reparameterization into ViT backbones, challenging the “deeper-is-better” paradigm. On ViT-Tiny, it reduces depth from 12 to 3–6 layers while preserving full ImageNet-1K classification accuracy. Moreover, it achieves up to a 37% speedup on mobile CPU inference—demonstrating significant efficiency gains without compromising model fidelity.

Technology Category

Application Category

📝 Abstract

The computational overhead of Vision Transformers in practice stems fundamentally from their deep architectures, yet existing acceleration strategies have primarily targeted algorithmic-level optimizations such as token pruning and attention speedup. This leaves an underexplored research question: can we reduce the number of stacked transformer layers while maintaining comparable representational capacity? To answer this, we propose a branch-based structural reparameterization technique that operates during the training phase. Our approach leverages parallel branches within transformer blocks that can be systematically consolidated into streamlined single-path models suitable for inference deployment. The consolidation mechanism works by gradually merging branches at the entry points of nonlinear components, enabling both feed-forward networks (FFN) and multi-head self-attention (MHSA) modules to undergo exact mathematical reparameterization without inducing approximation errors at test time. When applied to ViT-Tiny, the framework successfully reduces the original 12-layer architecture to 6, 4, or as few as 3 layers while maintaining classification accuracy on ImageNet-1K. The resulting compressed models achieve inference speedups of up to 37% on mobile CPU platforms. Our findings suggest that the conventional wisdom favoring extremely deep transformer stacks may be unnecessarily restrictive, and point toward new opportunities for constructing efficient vision transformers.

Problem

Research questions and friction points this paper is trying to address.

Reducing Vision Transformer layers while maintaining representational capacity

Using structural reparameterization to compress deep transformer architectures

Achieving inference speedups by consolidating parallel branches into streamlined models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Branch-based structural reparameterization for training

Merging parallel branches into single-path inference models

Reducing transformer layers while maintaining accuracy

🔎 Similar Papers

No similar papers found.