🤖 AI Summary
To address the prohibitively high computational cost of deeply stacked Vision Transformers (ViTs), this paper proposes a depth-compression framework based on branch-level structural reparameterization. During training, it employs a parallel multi-branch architecture with progressive fusion at the inputs of both feed-forward network (FFN) and multi-head self-attention (MHSA) modules; crucially, nonlinear paths undergo rigorous mathematical reparameterization to ensure equivalence. At inference, all branches are merged into a single path without any accuracy degradation. This work is the first to introduce structural reparameterization into ViT backbones, challenging the “deeper-is-better” paradigm. On ViT-Tiny, it reduces depth from 12 to 3–6 layers while preserving full ImageNet-1K classification accuracy. Moreover, it achieves up to a 37% speedup on mobile CPU inference—demonstrating significant efficiency gains without compromising model fidelity.
📝 Abstract
The computational overhead of Vision Transformers in practice stems fundamentally from their deep architectures, yet existing acceleration strategies have primarily targeted algorithmic-level optimizations such as token pruning and attention speedup. This leaves an underexplored research question: can we reduce the number of stacked transformer layers while maintaining comparable representational capacity? To answer this, we propose a branch-based structural reparameterization technique that operates during the training phase. Our approach leverages parallel branches within transformer blocks that can be systematically consolidated into streamlined single-path models suitable for inference deployment. The consolidation mechanism works by gradually merging branches at the entry points of nonlinear components, enabling both feed-forward networks (FFN) and multi-head self-attention (MHSA) modules to undergo exact mathematical reparameterization without inducing approximation errors at test time. When applied to ViT-Tiny, the framework successfully reduces the original 12-layer architecture to 6, 4, or as few as 3 layers while maintaining classification accuracy on ImageNet-1K. The resulting compressed models achieve inference speedups of up to 37% on mobile CPU platforms. Our findings suggest that the conventional wisdom favoring extremely deep transformer stacks may be unnecessarily restrictive, and point toward new opportunities for constructing efficient vision transformers.