🤖 AI Summary
Real-world data often fail to satisfy strict equivariance assumptions, limiting model performance. To address this, this work proposes a general framework that achieves controllable soft equivariance by projecting pretrained model weights onto a designated subspace, applicable to arbitrary backbone architectures such as Vision Transformers (ViT) and ResNets. The method provides theoretical guarantees in the form of bounded equivariance errors and demonstrates significant reductions in such errors on benchmarks like ImageNet. Furthermore, it consistently improves performance across diverse tasks—including image classification, semantic segmentation, and trajectory prediction—without requiring architectural modifications or retraining from scratch.
📝 Abstract
Equivariance is a fundamental property in computer vision models, yet strict equivariance is rarely satisfied in real-world data, which can limit a model's performance. Controlling the degree of equivariance is therefore desirable. We propose a general framework for constructing soft equivariant models by projecting the model weights into a designed subspace. The method applies to any pre-trained architecture and provides theoretical bounds on the induced equivariance error. Empirically, we demonstrate the effectiveness of our method on multiple pre-trained backbones, including ViT and ResNet, across image classification, semantic segmentation, and human-trajectory prediction tasks. Notably, our approach improves the performance while simultaneously reducing equivariance error on the competitive ImageNet benchmark.