Cross-Modal Mapping: Eliminating the Modality Gap for Few-Shot Image Classification

📅 2024-12-28

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

To address the weak class prototype representation in few-shot cross-modal image classification—caused by semantic gaps between modalities in vision-language models like CLIP—this paper proposes a lightweight Cross-Modal Mapping (CMM) method. CMM aligns visual and textual feature spaces linearly and employs an end-to-end optimized cross-modal triplet loss, enabling frozen text embeddings to serve directly as robust image class prototypes without fine-tuning the language model. Crucially, CMM avoids parameter-intensive adaptation entirely. It achieves an average 3.5% improvement across 11 standard few-shot benchmarks and demonstrates strong generalization under four distribution shift scenarios. The core contribution lies in bridging the semantic gap in pretrained multimodal representations via a minimal architectural design, thereby achieving, for the first time, plug-and-play, high-accuracy, and fine-tuning-free construction of textual class prototypes.

Technology Category

Application Category

📝 Abstract

In few-shot image classification tasks, methods based on pretrained vision-language models (such as CLIP) have achieved significant progress. Many existing approaches directly utilize visual or textual features as class prototypes, however, these features fail to adequately represent their respective classes. We identify that this limitation arises from the modality gap inherent in pretrained vision-language models, which weakens the connection between the visual and textual modalities. To eliminate this modality gap and enable textual features to fully represent class prototypes, we propose a simple and efficient Cross-Modal Mapping (CMM) method. This method employs a linear transformation to map image features into the textual feature space, ensuring that both modalities are comparable within the same feature space. Nevertheless, the modality gap diminishes the effectiveness of this mapping. To address this, we further introduce a triplet loss to optimize the spatial relationships between image features and class textual features, allowing class textual features to naturally serve as class prototypes for image features. Experimental results on 11 benchmark demonstrate an average improvement of approximately 3.5% compared to conventional methods and exhibit competitive performance on 4 distribution shift benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Cross-modal Image Classification

Pre-trained Vision-Language Models

Limited-shot Learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-modal Mapping

Triplet Loss

Visual Language Model

🔎 Similar Papers

No similar papers found.