🤖 AI Summary
Existing prototype-based self-supervised learning relies on a single prototype to represent all features within a cluster, failing to capture semantic diversity in data space. This work proposes Self-Organizing Prototypes (SOP), which abandons fixed prototypes and instead dynamically organizes multiple semantically similar support embeddings (SEs) to collaboratively model local feature structures. Methodologically: (i) it introduces the first multi-prototype collaborative representation mechanism; (ii) it designs a non-parametric SOP-MIM masked modeling task; and (iii) it integrates non-parametric contrastive learning, reconstruction loss, and dynamic SE organization for fully parameter-free feature-space modeling. SOP achieves state-of-the-art performance across diverse downstream tasks—including image retrieval, linear evaluation, fine-tuning, and object detection—with particularly pronounced gains when adapted to large-scale models.
📝 Abstract
We present Self-Organizing Visual Prototypes (SOP), a new training technique for unsupervised visual feature learning. Unlike existing prototypical self-supervised learning (SSL) methods that rely on a single prototype to encode all relevant features of a hidden cluster in the data, we propose the SOP strategy. In this strategy, a prototype is represented by many semantically similar representations, or support embeddings (SEs), each containing a complementary set of features that together better characterize their region in space and maximize training performance. We reaffirm the feasibility of non-parametric SSL by introducing novel non-parametric adaptations of two loss functions that implement the SOP strategy. Notably, we introduce the SOP Masked Image Modeling (SOP-MIM) task, where masked representations are reconstructed from the perspective of multiple non-parametric local SEs. We comprehensively evaluate the representations learned using the SOP strategy on a range of benchmarks, including retrieval, linear evaluation, fine-tuning, and object detection. Our pre-trained encoders achieve state-of-the-art performance on many retrieval benchmarks and demonstrate increasing performance gains with more complex encoders.