🤖 AI Summary
In existing self-supervised learning (SSL), a misalignment between pretraining and fine-tuning objectives hinders downstream adaptation. This work proposes BiSSL, the first SSL framework to incorporate bilevel optimization into the self-supervised pipeline: it introduces an explicit collaborative optimization stage between pretraining and fine-tuning, thereby modeling their interdependence and enabling end-to-end objective alignment. BiSSL is architecture-agnostic and compatible with arbitrary pretext tasks (e.g., SimCLR, BYOL) and downstream tasks (e.g., classification, detection). When pretraining ResNet-50 on ImageNet, BiSSL achieves significant improvements over standard SSL baselines across 12 diverse image classification and object detection benchmarks. Feature visualizations further reveal that the backbone network acquires enhanced downstream discriminability *before* fine-tuning. The core contribution lies in establishing a unified pretraining–fine-tuning joint optimization paradigm, which improves initialization quality and cross-task generalization.
📝 Abstract
This study presents BiSSL, a novel training framework that utilizes bilevel optimization to enhance the alignment between the pretext pre-training and downstream fine-tuning stages in self-supervised learning. BiSSL formulates the pretext and downstream task objectives as the lower- and upper-level objectives in a bilevel optimization problem and serves as an intermediate training stage within the self-supervised learning pipeline. By explicitly modeling the interdependence of these training stages, BiSSL facilitates enhanced information sharing between them, ultimately leading to a backbone parameter initialization that is better aligned for the downstream task. We propose a versatile training algorithm that alternates between optimizing the two objectives defined in BiSSL, which is applicable to a broad range of pretext and downstream tasks. Using SimCLR and Bootstrap Your Own Latent to pre-train ResNet-50 backbones on the ImageNet dataset, we demonstrate that our proposed framework significantly outperforms the conventional self-supervised learning pipeline on the vast majority of 12 downstream image classification datasets, as well as on object detection. Visualizations of the backbone features provide further evidence that BiSSL improves the downstream task alignment of the backbone features prior to fine-tuning.