🤖 AI Summary
This work addresses the lack of strict equivariance under the special Euclidean group SE(d) in standard Vision Transformers. Methodologically: (1) steerable convolutions are employed to extract SE(d)-equivariant features; (2) a nonlinear attention mechanism is formulated in the Fourier domain—bypassing spatial interpolation artifacts—to ensure exact translational and rotational equivariance; (3) frequency-domain nonlinear activations and SE(d)-equivariant feature encoding are introduced. Evaluated on 2D/3D geometric perception benchmarks, the model substantially outperforms purely steerable CNNs, demonstrating the efficacy of equivariant attention for robust geometric modeling. The core contribution is the first integration of strict SE(d) equivariance into a Transformer backbone, establishing a novel paradigm of Fourier-domain equivariant attention.
📝 Abstract
In this work we introduce Steerable Transformers, an extension of the Vision Transformer mechanism that maintains equivariance to the special Euclidean group $mathrm{SE}(d)$. We propose an equivariant attention mechanism that operates on features extracted by steerable convolutions. Operating in Fourier space, our network utilizes Fourier space non-linearities. Our experiments in both two and three dimensions show that adding steerable transformer layers to steerable convolutional networks enhances performance.