Harmony: A Joint Self-Supervised and Weakly-Supervised Framework for Learning General Purpose Visual Representations

📅 2024-05-23
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language contrastive learning methods (e.g., CLIP) lack local feature modeling capability, limiting their applicability to dense prediction tasks such as segmentation and detection; meanwhile, purely self-supervised approaches lack semantic guidance. This paper proposes the first unified pre-training framework jointly leveraging weak supervision and self-supervision: it replaces hard negative sampling with soft CLIP targets generated by an EMA teacher, mitigating image-text misalignment; it synergistically optimizes discriminative (iBOT-style masked modeling) and generative (MAE-style reconstruction) self-supervised objectives for the first time, augmented with soft distillation. No human annotations or explicit negative sampling are required. The method achieves state-of-the-art performance on ImageNet-1K zero-shot classification and fine-tuning, ADE20K semantic segmentation, and COCO object detection and instance segmentation—outperforming CLIP, MaskCLIP, SLIP, iBOT, and MAE. Using only CC3M data, ViT-S/16 attains new SOTA results.

Technology Category

Application Category

📝 Abstract
Vision-language contrastive learning frameworks like CLIP enable learning representations from natural language supervision, and provide strong zero-shot classification capabilities. However, due to the nature of the supervisory signal in these paradigms, they lack the ability to learn localized features, leading to degraded performance on dense prediction tasks like segmentation and detection. On the other hand, self-supervised learning methods have shown the ability to learn granular representations, complementing the high-level features in vision-language training. In this work, we present Harmony, a framework that combines vision-language training with discriminative and generative self-supervision to learn visual features that can be generalized across vision downstream tasks. Our framework is specifically designed to work on web-scraped data by not relying on negative examples and addressing the one-to-one correspondence issue using soft CLIP targets generated by an EMA model. We comprehensively evaluate Harmony across various vision downstream tasks and find that it significantly outperforms the baseline CLIP and the previously leading joint self and weakly-supervised methods, MaskCLIP and SLIP. Specifically, when comparing against these methods, Harmony shows superior performance in fine-tuning and zero-shot classification on ImageNet-1k, semantic segmentation on ADE20K, and both object detection and instance segmentation on MS-COCO, when pre-training a ViT-S/16 on CC3M. We also show that Harmony outperforms other self-supervised learning methods like iBOT and MAE across all tasks evaluated. On https://github.com/MohammedSB/Harmony our code is publicly available.
Problem

Research questions and friction points this paper is trying to address.

Combines vision-language and self-supervised learning for generalized visual features
Addresses lack of localized features in CLIP for dense prediction tasks
Improves performance on segmentation, detection, and zero-shot classification
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines vision-language and self-supervised learning
Uses soft CLIP targets from EMA model
Designed for web-scraped data without negatives
🔎 Similar Papers
No similar papers found.