🤖 AI Summary
This work addresses the training instability commonly induced by multi-crop strategies in predictor-based self-supervised learning. To mitigate this issue, the authors propose a multi-task asymmetric siamese network that assigns dedicated predictors to global, local, and masked crop views, treating each spatial transformation as an independent alignment task. The framework further incorporates Cutout-based masking augmentation to enhance representation consistency. Compatible with both ResNet and Vision Transformer (ViT) backbones, the proposed method achieves substantial improvements in representation learning performance on ImageNet, while maintaining strong training stability and broad applicability across architectures.
📝 Abstract
Self-supervised learning (SSL) methods based on Siamese networks learn visual representations by aligning different views of the same image. The multi-crop strategy, which incorporates small local crops to global ones, enhances many SSL frameworks but causes instability in predictor-based architectures such as BYOL, SimSiam, and MoCo v3. We trace this failure to the shared predictor used across all views and demonstrate that assigning a separate predictor to each view type stabilizes multi-crop training, resulting in significant performance gains. Extending this idea, we treat each spatial transformation as a distinct alignment task and add cutout views, where part of the image is masked before encoding. This yields a simple multi-task formulation of asymmetric Siamese SSL that combines global, local, and masked views into a single framework. The approach is stable, generally applicable across backbones, and consistently improves the performance of ResNet and ViT models on ImageNet.