🤖 AI Summary
This work proposes ViT-5, a systematically optimized Vision Transformer backbone that enhances the native ViT architecture through advanced normalization strategies, a novel activation function, improved positional encoding, gating mechanisms, and learnable tokens—all while preserving the standard attention–feedforward network structure. Designed as a plug-and-play upgrade, ViT-5 aligns with contemporary foundation model practices and achieves a top-1 accuracy of 84.2% on ImageNet-1k, outperforming DeiT-III-Base. Furthermore, when integrated into a SiT diffusion model, it reduces the FID score to 1.84, demonstrating substantial improvements in visual understanding, generation quality, and cross-task transferability.
📝 Abstract
This work presents a systematic investigation into modernizing Vision Transformer backbones by leveraging architectural advancements from the past five years. While preserving the canonical Attention-FFN structure, we conduct a component-wise refinement involving normalization, activation functions, positional encoding, gating mechanisms, and learnable tokens. These updates form a new generation of Vision Transformers, which we call ViT-5. Extensive experiments demonstrate that ViT-5 consistently outperforms state-of-the-art plain Vision Transformers across both understanding and generation benchmarks. On ImageNet-1k classification, ViT-5-Base reaches 84.2\% top-1 accuracy under comparable compute, exceeding DeiT-III-Base at 83.8\%. ViT-5 also serves as a stronger backbone for generative modeling: when plugged into an SiT diffusion framework, it achieves 1.84 FID versus 2.06 with a vanilla ViT backbone. Beyond headline metrics, ViT-5 exhibits improved representation learning and favorable spatial reasoning behavior, and transfers reliably across tasks. With a design aligned with contemporary foundation-model practices, ViT-5 offers a simple drop-in upgrade over vanilla ViT for mid-2020s vision backbones.