🤖 AI Summary
Self-supervised learning (SSL) representations are often constrained by the inductive bias of a single backbone architecture. Method: We propose Heterogeneous Self-Supervised Learning (HSSL), a framework that introduces lightweight, structurally heterogeneous auxiliary heads—e.g., Transformer- and CNN-based—alongside the fixed backbone, enabling collaborative representation optimization without modifying the main network. Contribution/Results: We systematically demonstrate, for the first time, a positive correlation between architectural heterogeneity and representation quality. Leveraging this insight, we design a disparity-driven representation distillation mechanism and an efficient auxiliary-head search strategy. Extensive experiments across image classification, semantic/instance segmentation, and object detection show that HSSL consistently outperforms leading SSL methods—including MoCo and SimCLR—while maintaining compatibility with diverse contrastive learning baselines. This validates both the effectiveness and generalizability of architectural complementarity in self-supervised representation learning.
📝 Abstract
Incorporating heterogeneous representations from different architectures has facilitated various vision tasks, e.g., some hybrid networks combine transformers and convolutions. However, complementarity between such heterogeneous architectures has not been well exploited in self-supervised learning. Thus, we propose Heterogeneous Self-Supervised Learning (HSSL), which enforces a base model to learn from an auxiliary head whose architecture is heterogeneous from the base model. In this process, HSSL endows the base model with new characteristics in a representation learning way without structural changes. To comprehensively understand the HSSL, we conduct experiments on various heterogeneous pairs containing a base model and an auxiliary head. We discover that the representation quality of the base model moves up as their architecture discrepancy grows. This observation motivates us to propose a search strategy that quickly determines the most suitable auxiliary head for a specific base model to learn and several simple but effective methods to enlarge the model discrepancy. The HSSL is compatible with various self-supervised methods, achieving superior performances on various downstream tasks, including image classification, semantic segmentation, instance segmentation, and object detection.