Compression then Matching: An Efficient Pre-training Paradigm for Multimodal Embedding

📅 2025-11-11

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

This work addresses the fundamental bottleneck in vision-language models (VLMs): the difficulty of simultaneously preserving semantic fidelity and ensuring downstream discriminability. To this end, we propose CoMa—a novel pretraining paradigm that decouples semantic preservation from discriminative feature learning by introducing compression learning as a warm-up stage preceding contrastive learning. CoMa achieves efficient semantic distillation and feature compression using only a small amount of data. On the MMEB benchmark, it attains state-of-the-art performance among models of comparable scale, significantly improving embedding quality for cross-modal retrieval, clustering, and classification. Its core innovation lies in the first explicit formulation of compression objectives as a semantic initialization mechanism for contrastive learning—thereby jointly optimizing training efficiency and representation capability. Crucially, CoMa delivers high-performance multimodal embeddings even under stringent low-data-budget constraints.

Technology Category

Application Category

📝 Abstract

Vision-language models advance multimodal representation learning by acquiring transferable semantic embeddings, thereby substantially enhancing performance across a range of vision-language tasks, including cross-modal retrieval, clustering, and classification. An effective embedding is expected to comprehensively preserve the semantic content of the input while simultaneously emphasizing features that are discriminative for downstream tasks. Recent approaches demonstrate that VLMs can be adapted into competitive embedding models via large-scale contrastive learning, enabling the simultaneous optimization of two complementary objectives. We argue that the two aforementioned objectives can be decoupled: a comprehensive understanding of the input facilitates the embedding model in achieving superior performance in downstream tasks via contrastive learning. In this paper, we propose CoMa, a compressed pre-training phase, which serves as a warm-up stage for contrastive learning. Experiments demonstrate that with only a small amount of pre-training data, we can transform a VLM into a competitive embedding model. CoMa achieves new state-of-the-art results among VLMs of comparable size on the MMEB, realizing optimization in both efficiency and effectiveness.

Problem

Research questions and friction points this paper is trying to address.

Decoupling comprehensive understanding from discriminative feature optimization

Developing efficient pre-training for multimodal embedding models

Enhancing vision-language model performance with limited data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Compressed pre-training phase for contrastive learning

Decouples comprehensive input understanding from discriminative features

Transforms VLMs into competitive embedding models efficiently

🔎 Similar Papers

CSA: Data-efficient Mapping of Unimodal Features to Multimodal Features