Harmony: A Joint Self-Supervised and Weakly-Supervised Framework for Learning General Purpose Visual Representations

📅 2024-05-23

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

Existing vision-language contrastive learning methods (e.g., CLIP) lack local feature modeling capability, limiting their applicability to dense prediction tasks such as segmentation and detection; meanwhile, purely self-supervised approaches lack semantic guidance. This paper proposes the first unified pre-training framework jointly leveraging weak supervision and self-supervision: it replaces hard negative sampling with soft CLIP targets generated by an EMA teacher, mitigating image-text misalignment; it synergistically optimizes discriminative (iBOT-style masked modeling) and generative (MAE-style reconstruction) self-supervised objectives for the first time, augmented with soft distillation. No human annotations or explicit negative sampling are required. The method achieves state-of-the-art performance on ImageNet-1K zero-shot classification and fine-tuning, ADE20K semantic segmentation, and COCO object detection and instance segmentation—outperforming CLIP, MaskCLIP, SLIP, iBOT, and MAE. Using only CC3M data, ViT-S/16 attains new SOTA results.

Technology Category

Application Category

📝 Abstract

Vision-language contrastive learning frameworks like CLIP enable learning representations from natural language supervision, and provide strong zero-shot classification capabilities. However, due to the nature of the supervisory signal in these paradigms, they lack the ability to learn localized features, leading to degraded performance on dense prediction tasks like segmentation and detection. On the other hand, self-supervised learning methods have shown the ability to learn granular representations, complementing the high-level features in vision-language training. In this work, we present Harmony, a framework that combines vision-language training with discriminative and generative self-supervision to learn visual features that can be generalized across vision downstream tasks. Our framework is specifically designed to work on web-scraped data by not relying on negative examples and addressing the one-to-one correspondence issue using soft CLIP targets generated by an EMA model. We comprehensively evaluate Harmony across various vision downstream tasks and find that it significantly outperforms the baseline CLIP and the previously leading joint self and weakly-supervised methods, MaskCLIP and SLIP. Specifically, when comparing against these methods, Harmony shows superior performance in fine-tuning and zero-shot classification on ImageNet-1k, semantic segmentation on ADE20K, and both object detection and instance segmentation on MS-COCO, when pre-training a ViT-S/16 on CC3M. We also show that Harmony outperforms other self-supervised learning methods like iBOT and MAE across all tasks evaluated. On https://github.com/MohammedSB/Harmony our code is publicly available.

Problem

Research questions and friction points this paper is trying to address.

Combines vision-language and self-supervised learning for generalized visual features

Addresses lack of localized features in CLIP for dense prediction tasks

Improves performance on segmentation, detection, and zero-shot classification

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines vision-language and self-supervised learning

Uses soft CLIP targets from EMA model

Designed for web-scraped data without negatives

🔎 Similar Papers

No similar papers found.