ViT-5: Vision Transformers for The Mid-2020s

📅 2026-02-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes ViT-5, a systematically optimized Vision Transformer backbone that enhances the native ViT architecture through advanced normalization strategies, a novel activation function, improved positional encoding, gating mechanisms, and learnable tokens—all while preserving the standard attention–feedforward network structure. Designed as a plug-and-play upgrade, ViT-5 aligns with contemporary foundation model practices and achieves a top-1 accuracy of 84.2% on ImageNet-1k, outperforming DeiT-III-Base. Furthermore, when integrated into a SiT diffusion model, it reduces the FID score to 1.84, demonstrating substantial improvements in visual understanding, generation quality, and cross-task transferability.

Technology Category

Application Category

📝 Abstract
This work presents a systematic investigation into modernizing Vision Transformer backbones by leveraging architectural advancements from the past five years. While preserving the canonical Attention-FFN structure, we conduct a component-wise refinement involving normalization, activation functions, positional encoding, gating mechanisms, and learnable tokens. These updates form a new generation of Vision Transformers, which we call ViT-5. Extensive experiments demonstrate that ViT-5 consistently outperforms state-of-the-art plain Vision Transformers across both understanding and generation benchmarks. On ImageNet-1k classification, ViT-5-Base reaches 84.2\% top-1 accuracy under comparable compute, exceeding DeiT-III-Base at 83.8\%. ViT-5 also serves as a stronger backbone for generative modeling: when plugged into an SiT diffusion framework, it achieves 1.84 FID versus 2.06 with a vanilla ViT backbone. Beyond headline metrics, ViT-5 exhibits improved representation learning and favorable spatial reasoning behavior, and transfers reliably across tasks. With a design aligned with contemporary foundation-model practices, ViT-5 offers a simple drop-in upgrade over vanilla ViT for mid-2020s vision backbones.
Problem

Research questions and friction points this paper is trying to address.

Vision Transformers
backbone modernization
visual understanding
generative modeling
representation learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision Transformer
architectural refinement
position encoding
gating mechanism
foundation model
🔎 Similar Papers
No similar papers found.