🤖 AI Summary
Self-supervised image representation learning suffers from insufficient robustness against noise, adversarial perturbations, and severe cropping. Method: This paper proposes CO-SSL—a novel framework that, for the first time, establishes local-global feature alignment as the core mechanism of instance-discriminative self-supervised learning (SSL), explicitly modeling the spatial co-occurrence between local regions and global semantics *before* pooling. CO-SSL abandons masking and aggressive cropping, relying solely on lightweight data augmentations to uncover the intrinsic role of highly redundant local representations in enhancing robustness. Results: Trained for 100 epochs on ImageNet-1K, CO-SSL achieves 71.5% top-1 accuracy—surpassing prior SSL methods. It demonstrates exceptional robustness under image noise, internal perturbations, small-scale adversarial attacks, and large-area cropping, empirically validating the effectiveness and generalizability of local-global alignment for robust representation learning.
📝 Abstract
Recent successes in self-supervised learning (SSL) model spatial co-occurrences of visual features either by masking portions of an image or by aggressively cropping it. Here, we propose a new way to model spatial co-occurrences by aligning local representations (before pooling) with a global image representation. We present CO-SSL, a family of instance discrimination methods and show that it outperforms previous methods on several datasets, including ImageNet-1K where it achieves 71.5% of Top-1 accuracy with 100 pre-training epochs. CO-SSL is also more robust to noise corruption, internal corruption, small adversarial attacks, and large training crop sizes. Our analysis further indicates that CO-SSL learns highly redundant local representations, which offers an explanation for its robustness. Overall, our work suggests that aligning local and global representations may be a powerful principle of unsupervised category learning.