🤖 AI Summary
This work addresses the tension in self-supervised learning between enforcing strong invariance—often detrimental to geometric robustness and spatially sensitive tasks—and preserving equivariant structures. Existing approaches typically couple invariance and equivariance objectives in the final representation, leading to suboptimal trade-offs. To resolve this, we propose Soft Equivariant Regularization (SER), which, for the first time, decouples these goals via a layer-wise design: analytical group actions ρ_g are applied directly to intermediate feature maps, imposing equivariance constraints independently of the main self-supervised objective. SER requires no additional prediction heads or transformation labels and incurs only a 1.008× increase in training FLOPs. It consistently improves performance across frameworks: MoCo-v3 gains +0.84 Top-1 accuracy on ImageNet-1k linear evaluation, while DINO and Barlow Twins also benefit; robustness improves by +1.11/+1.22 Top-1 on ImageNet-C/P, and COCO detection accuracy rises by +1.7 mAP.
📝 Abstract
Self-supervised learning (SSL) typically learns representations invariant to semantic-preserving augmentations. While effective for recognition, enforcing strong invariance can suppress transformation-dependent structure that is useful for robustness to geometric perturbations and spatially sensitive transfer. A growing body of work, therefore, augments invariance-based SSL with equivariance objectives, but these objectives are often imposed on the same final representation. We empirically observe a trade-off in this coupled setting: pushing equivariance regularization toward deeper layers improves equivariance scores but degrades ImageNet-1k linear evaluation, motivating a layer-decoupled design. Motivated by this trade-off, we propose Soft Equivariance Regularization (SER), a plug-in regularizer that decouples where invariance and equivariance are enforced: we keep the base SSL objective unchanged on the final embedding, while softly encouraging equivariance on an intermediate spatial token map via analytically specified group actions $\rho_g$ applied directly in feature space. SER learns/predicts no per-sample transformation codes/labels, requires no auxiliary transformation-prediction head, and adds only 1.008x training FLOPs. On ImageNet-1k ViT-S/16 pretraining, SER improves MoCo-v3 by +0.84 Top-1 in linear evaluation under a strictly matched 2-view setting and consistently improves DINO and Barlow Twins; under matched view counts, SER achieves the best ImageNet-1k linear-eval Top-1 among the compared invariance+equivariance add-ons. SER further improves ImageNet-C/P by +1.11/+1.22 Top-1 and frozen-backbone COCO detection by +1.7 mAP. Finally, applying the same layer-decoupling recipe to existing invariance+equivariance baselinesimproves their accuracy, suggesting layer decoupling as a general design principle for combining invariance and equivariance.