COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training

📅 2024-12-02
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language models (VLMs) rely heavily on global contrastive loss, overemphasizing foreground objects while neglecting local image structures and contextual information—thereby limiting performance on downstream tasks. To address this, we propose a cross-modal self-distillation framework featuring a novel text cropping strategy and a learnable dual-path cross-attention module, systematically constructing multi-granular (local/global) image-text views for fine-grained cross-modal alignment. Our method unifies contrastive learning with self-supervised knowledge distillation to jointly optimize multi-scale representations. Extensive experiments demonstrate state-of-the-art performance across zero-shot image–text retrieval, classification, and semantic segmentation—outperforming strong baselines including CLIP and FLAVA. Notably, under data-scarce settings, our model achieves superior visual perception and contextual understanding compared to significantly larger CLIP variants.

Technology Category

Application Category

📝 Abstract
Vision-Language Models (VLMs) trained with contrastive loss have achieved significant advancements in various vision and language tasks. However, the global nature of contrastive loss makes VLMs focus predominantly on foreground objects, neglecting other crucial information in the image, which limits their effectiveness in downstream tasks. To address these challenges, we propose COSMOS: CrOSs-MOdality Self-distillation for vision-language pre-training that integrates a novel text-cropping strategy and cross-attention module into a self-supervised learning framework. We create global and local views of images and texts (i.e., multi-modal augmentations), which are essential for self-distillation in VLMs. We further introduce a cross-attention module, enabling COSMOS to learn comprehensive cross-modal representations optimized via a cross-modality self-distillation loss. COSMOS consistently outperforms previous strong baselines on various zero-shot downstream tasks, including retrieval, classification, and semantic segmentation. Additionally, it surpasses CLIP-based models trained on larger datasets in visual perception and contextual understanding tasks.
Problem

Research questions and friction points this paper is trying to address.

VLMs focus on foreground objects, neglecting crucial image information
Need for comprehensive cross-modal representations in vision-language tasks
Improving zero-shot performance in retrieval, classification, and segmentation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates text-cropping strategy for self-distillation
Uses cross-attention module for cross-modal learning
Creates global and local multi-modal augmentations
🔎 Similar Papers
No similar papers found.