🤖 AI Summary
Existing equivariant Vision Transformers struggle to simultaneously achieve high performance and strict group equivariance, often lacking consistency across their components. This work proposes the first unified equivariant ViT framework, systematically designing core modules—including patch embedding, self-attention, positional encoding, and down/up-sampling—as jointly equivariant under a specified group. The resulting architecture provides rigorous theoretical guarantees of equivariance while remaining plug-and-play compatible with existing models. Notably, the framework extends naturally to variants such as Swin Transformer, consistently delivering significant improvements in both performance and data efficiency across diverse vision tasks, thereby demonstrating its effectiveness and broad applicability.
📝 Abstract
Incorporating symmetry priors as inductive biases to design equivariant Vision Transformers (ViTs) has emerged as a promising avenue for enhancing their performance. However, existing equivariant ViTs often struggle to balance performance with equivariance, primarily due to the challenge of achieving holistic equivariant modifications across the diverse modules in ViTs-particularly in harmonizing the Self-Attention mechanism with Patch Embedding. To address this, we propose a straightforward framework that systematically renders key ViT components, including patch embedding, self-attention, positional encodings, and Down/Up-Sampling, equivariant, thereby constructing ViTs with guaranteed equivariance. The resulting architecture serves as a plug-and-play replacement that is both theoretically grounded and practically versatile, scaling seamlessly even to Swin Transformers. Extensive experiments demonstrate that our equivariant ViTs consistently improve performance and data efficiency across a wide spectrum of vision tasks.