π€ AI Summary
Existing contrastive learning approaches pretrain only the encoder, leaving the decoder to be trained separately for downstream tasksβthus overlooking the potential of end-to-end joint optimization. This paper proposes DeCon, the first framework enabling end-to-end self-supervised contrastive pretraining of encoder-decoder architectures. Specifically, (1) it extends single-encoder contrastive methods to trainable encoder-decoder structures; (2) introduces a weighted encoder-decoder cooperative contrastive loss that facilitates non-competitive joint optimization; and (3) preserves framework agnosticism and compatibility with heterogeneous decoders. DeCon achieves state-of-the-art performance on COCO for object detection and instance segmentation, as well as on Pascal VOC for semantic segmentation. Moreover, it significantly improves few-shot and cross-domain generalization capabilities, demonstrating the effectiveness of unified encoder-decoder pretraining.
π Abstract
Contrastive learning in self-supervised settings primarily focuses on pre-training encoders, while decoders are typically introduced and trained separately for downstream dense prediction tasks. This conventional approach, however, overlooks the potential benefits of jointly pre-training both the encoder and decoder. In this paper, we propose DeCon: a framework-agnostic adaptation to convert an encoder-only self-supervised learning (SSL) contrastive approach to an efficient encoder-decoder framework that can be pre-trained in a contrastive manner. We first update the existing architecture to accommodate a decoder and its respective contrastive loss. We then introduce a weighted encoder-decoder contrastive loss with non-competing objectives that facilitates the joint encoder-decoder architecture pre-training. We adapt two established contrastive SSL frameworks tailored for dense prediction tasks, achieve new state-of-the-art results in COCO object detection and instance segmentation, and match state-of-the-art performance on Pascal VOC semantic segmentation. We show that our approach allows for pre-training a decoder and enhances the representation power of the encoder and its performance in dense prediction tasks. This benefit holds across heterogeneous decoder architectures between pre-training and fine-tuning and persists in out-of-domain, limited-data scenarios.