Alias-Free ViT: Fractional Shift Invariance via Linear Attention

๐Ÿ“… 2025-10-26
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Vision Transformers (ViTs) lack the inductive bias of convolutional layers, resulting in poor robustness to both integer and fractional image translations; conversely, conventional CNNs suffer from aliasing artifacts introduced by downsampling and nonlinear activations, preventing strict translation invariance. This work proposes a linear cross-covariance attention mechanismโ€”first enabling ViTs to achieve continuous equivariance to fractional translations. Coupled with anti-aliasing downsampling and aliasing-resistant nonlinear activations, our approach systematically eliminates all major sources of aliasing. The resulting model maintains state-of-the-art accuracy on ImageNet and other image classification benchmarks while significantly outperforming comparably sized ViT and CNN baselines under adversarial translation perturbations. These results empirically validate the effectiveness and generalization benefits of continuously translation-equivariant representations.

Technology Category

Application Category

๐Ÿ“ Abstract
Transformers have emerged as a competitive alternative to convnets in vision tasks, yet they lack the architectural inductive bias of convnets, which may hinder their potential performance. Specifically, Vision Transformers (ViTs) are not translation-invariant and are more sensitive to minor image translations than standard convnets. Previous studies have shown, however, that convnets are also not perfectly shift-invariant, due to aliasing in downsampling and nonlinear layers. Consequently, anti-aliasing approaches have been proposed to certify convnets' translation robustness. Building on this line of work, we propose an Alias-Free ViT, which combines two main components. First, it uses alias-free downsampling and nonlinearities. Second, it uses linear cross-covariance attention that is shift-equivariant to both integer and fractional translations, enabling a shift-invariant global representation. Our model maintains competitive performance in image classification and outperforms similar-sized models in terms of robustness to adversarial translations.
Problem

Research questions and friction points this paper is trying to address.

Addressing Vision Transformers' sensitivity to image translations
Developing shift-equivariant attention for fractional translations
Improving robustness against adversarial translations in ViTs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Alias-free downsampling and nonlinearities for shift invariance
Linear cross-covariance attention enabling fractional translation equivariance
Combining alias-free components with linear attention transformers
๐Ÿ”Ž Similar Papers
No similar papers found.