Should we pre-train a decoder in contrastive learning for dense prediction tasks?

📅 2025-03-21

📈 Citations: 0

✨ Influential: 0

career value

137K/year

🤖 AI Summary

Existing contrastive learning approaches pretrain only the encoder, leaving the decoder to be trained separately for downstream tasks—thus overlooking the potential of end-to-end joint optimization. This paper proposes DeCon, the first framework enabling end-to-end self-supervised contrastive pretraining of encoder-decoder architectures. Specifically, (1) it extends single-encoder contrastive methods to trainable encoder-decoder structures; (2) introduces a weighted encoder-decoder cooperative contrastive loss that facilitates non-competitive joint optimization; and (3) preserves framework agnosticism and compatibility with heterogeneous decoders. DeCon achieves state-of-the-art performance on COCO for object detection and instance segmentation, as well as on Pascal VOC for semantic segmentation. Moreover, it significantly improves few-shot and cross-domain generalization capabilities, demonstrating the effectiveness of unified encoder-decoder pretraining.

Technology Category

Application Category

📝 Abstract

Contrastive learning in self-supervised settings primarily focuses on pre-training encoders, while decoders are typically introduced and trained separately for downstream dense prediction tasks. This conventional approach, however, overlooks the potential benefits of jointly pre-training both the encoder and decoder. In this paper, we propose DeCon: a framework-agnostic adaptation to convert an encoder-only self-supervised learning (SSL) contrastive approach to an efficient encoder-decoder framework that can be pre-trained in a contrastive manner. We first update the existing architecture to accommodate a decoder and its respective contrastive loss. We then introduce a weighted encoder-decoder contrastive loss with non-competing objectives that facilitates the joint encoder-decoder architecture pre-training. We adapt two established contrastive SSL frameworks tailored for dense prediction tasks, achieve new state-of-the-art results in COCO object detection and instance segmentation, and match state-of-the-art performance on Pascal VOC semantic segmentation. We show that our approach allows for pre-training a decoder and enhances the representation power of the encoder and its performance in dense prediction tasks. This benefit holds across heterogeneous decoder architectures between pre-training and fine-tuning and persists in out-of-domain, limited-data scenarios.

Problem

Research questions and friction points this paper is trying to address.

Pre-train decoder in contrastive learning for dense prediction

Joint encoder-decoder pre-training improves representation power

Enhance performance in object detection and segmentation tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Joint pre-training of encoder and decoder

Framework-agnostic contrastive learning adaptation

Weighted encoder-decoder contrastive loss

🔎 Similar Papers

No similar papers found.