COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training

📅 2024-12-02

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

Existing vision-language models (VLMs) rely heavily on global contrastive loss, overemphasizing foreground objects while neglecting local image structures and contextual information—thereby limiting performance on downstream tasks. To address this, we propose a cross-modal self-distillation framework featuring a novel text cropping strategy and a learnable dual-path cross-attention module, systematically constructing multi-granular (local/global) image-text views for fine-grained cross-modal alignment. Our method unifies contrastive learning with self-supervised knowledge distillation to jointly optimize multi-scale representations. Extensive experiments demonstrate state-of-the-art performance across zero-shot image–text retrieval, classification, and semantic segmentation—outperforming strong baselines including CLIP and FLAVA. Notably, under data-scarce settings, our model achieves superior visual perception and contextual understanding compared to significantly larger CLIP variants.

Technology Category

Application Category

📝 Abstract

Vision-Language Models (VLMs) trained with contrastive loss have achieved significant advancements in various vision and language tasks. However, the global nature of contrastive loss makes VLMs focus predominantly on foreground objects, neglecting other crucial information in the image, which limits their effectiveness in downstream tasks. To address these challenges, we propose COSMOS: CrOSs-MOdality Self-distillation for vision-language pre-training that integrates a novel text-cropping strategy and cross-attention module into a self-supervised learning framework. We create global and local views of images and texts (i.e., multi-modal augmentations), which are essential for self-distillation in VLMs. We further introduce a cross-attention module, enabling COSMOS to learn comprehensive cross-modal representations optimized via a cross-modality self-distillation loss. COSMOS consistently outperforms previous strong baselines on various zero-shot downstream tasks, including retrieval, classification, and semantic segmentation. Additionally, it surpasses CLIP-based models trained on larger datasets in visual perception and contextual understanding tasks.

Problem

Research questions and friction points this paper is trying to address.

VLMs focus on foreground objects, neglecting crucial image information

Need for comprehensive cross-modal representations in vision-language tasks

Improving zero-shot performance in retrieval, classification, and segmentation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates text-cropping strategy for self-distillation

Uses cross-attention module for cross-modal learning

Creates global and local multi-modal augmentations

🔎 Similar Papers

Law of Vision Representation in MLLMs