Reconstructing Content via Collaborative Attention to Improve Multimodal Embedding Quality

📅 2026-03-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal large language models rely on contrastive learning, and their causal attention mechanisms coupled with next-token prediction struggle to produce globally compact semantic embeddings. To address this limitation, this work proposes the CoCoA pretraining paradigm, which uniquely integrates cooperative attention with an EOS-driven content reconstruction task. This approach explicitly guides the model to compress the semantics of multimodal inputs into the <EOS> token, yielding high-quality, information-dense embedding representations. Fine-tuning experiments based on Qwen2-VL and Qwen2.5-VL demonstrate that CoCoA significantly enhances embedding performance on the MMEB-V1 benchmark, effectively overcoming the constraints imposed by conventional generative paradigms on embedding quality.

Technology Category

Application Category

📝 Abstract
Multimodal embedding models, rooted in multimodal large language models (MLLMs), have yielded significant performance improvements across diverse tasks such as retrieval and classification. However, most existing approaches rely heavily on large-scale contrastive learning, with limited exploration of how the architectural and training paradigms of MLLMs affect embedding quality. While effective for generation, the causal attention and next-token prediction paradigm of MLLMs does not explicitly encourage the formation of globally compact representations, limiting their effectiveness as multimodal embedding backbones. To address this, we propose CoCoA, a Content reconstruction pre-training paradigm based on Collaborative Attention for multimodal embedding optimization. Specifically, we restructure the attention flow and introduce an EOS-based reconstruction task, encouraging the model to reconstruct input from the corresponding <EOS> embeddings. This drives the multimodal model to compress the semantic information of the input into the <EOS> token, laying the foundations for subsequent contrastive learning. Extensive experiments on MMEB-V1 demonstrate that CoCoA built upon Qwen2-VL and Qwen2.5-VL significantly improves embedding quality. Results validate that content reconstruction serves as an effective strategy to maximize the value of existing data, enabling multimodal embedding models generate compact and informative representations, raising their performance ceiling.
Problem

Research questions and friction points this paper is trying to address.

multimodal embedding
multimodal large language models
representation compactness
causal attention
embedding quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Collaborative Attention
Content Reconstruction
Multimodal Embedding
EOS-based Representation
Pre-training Paradigm
🔎 Similar Papers
No similar papers found.
J
Jiahan Chen
State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences, Beijing, China
Da Li
Da Li
Beijing institute of technology
Radar systemCross-modal learningSensor fusion
H
Hengran Zhang
State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences, Beijing, China
Yinqiong Cai
Yinqiong Cai
Institute of Computing Technology, Chinese Academy of Sciences
Information RetrievalNLPDeep Learning
Lixin Su
Lixin Su
Baidu Inc.
Information RetrievalQuestion Answering
Jiafeng Guo
Jiafeng Guo
Professor, Institute of Computing Techonology, CAS
Information RetrievalMachine LearningText AnalysisNeuIR
D
Daiting Shi
Baidu Inc., Beijing, China
Dawei Yin
Dawei Yin
Senior Director, Head of Search Science at Baidu
Machine LearningWeb MiningData Mining
Keping Bi
Keping Bi
Institute of Computing Technology, Chinese Academy of Sciences
Information Retrieval