CREM: Compression-Driven Representation Enhancement for Multimodal Retrieval and Comprehension

📅 2026-02-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the performance limitations of multimodal large language models (MLLMs) in retrieval tasks, which often stem from a mismatch between output formats and optimization objectives, as well as the common trade-off where enhancing retrieval capabilities compromises generative performance. To overcome these challenges, the authors propose CREM, a unified framework that jointly optimizes generative and contrastive objectives through a compression-driven representation enhancement mechanism. Key innovations include learnable chorus tokens, a compression-based prompt design, and a compression-aware attention mechanism. CREM achieves state-of-the-art performance on the MMEB retrieval benchmark while maintaining strong capabilities across diverse multimodal understanding and generation tasks.

Technology Category

Application Category

📝 Abstract
Multimodal Large Language Models (MLLMs) have shown remarkable success in comprehension tasks such as visual description and visual question answering. However, their direct application to embedding-based tasks like retrieval remains challenging due to the discrepancy between output formats and optimization objectives. Previous approaches often employ contrastive fine-tuning to adapt MLLMs for retrieval, but at the cost of losing their generative capabilities. We argue that both generative and embedding tasks fundamentally rely on shared cognitive mechanisms, specifically cross-modal representation alignment and contextual comprehension. To this end, we propose CREM (Compression-driven Representation Enhanced Model), with a unified framework that enhances multimodal representations for retrieval while preserving generative ability. Specifically, we introduce a compression-based prompt design with learnable chorus tokens to aggregate multimodal semantics and a compression-driven training strategy that integrates contrastive and generative objectives through compression-aware attention. Extensive experiments demonstrate that CREM achieves state-of-the-art retrieval performance on MMEB while maintaining strong generative performance on multiple comprehension benchmarks. Our findings highlight that generative supervision can further improve the representational quality of MLLMs under the proposed compression-driven paradigm.
Problem

Research questions and friction points this paper is trying to address.

Multimodal Large Language Models
retrieval
generative capability
representation alignment
embedding-based tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

compression-driven representation
multimodal retrieval
generative-contrastive alignment
learnable chorus tokens
multimodal large language models
🔎 Similar Papers
Lihao Liu
Lihao Liu
Amazon
LLM-based AgentHealthcare AI
Y
Yan Wang
Kuaishou Technology
Biao Yang
Biao Yang
Shanghai Jiao Tong University, Antai College of Economics and Management
Asset PricingClimate Finance
D
Da Li
Kuaishou Technology
Jiangxia Cao
Jiangxia Cao
Kuaishou Tech
RecSysLow-Resource Large Model
Y
Yuxiao Luo
Kuaishou Technology
X
Xiang Chen
Kuaishou Technology
X
Xiangyu Wu
Kuaishou Technology
W
Wei Yuan
Kuaishou Technology
F
Fan Yang
Kuaishou Technology
Guiguang Ding
Guiguang Ding
Tsinghua University
Computer VisionMultimedia Retrieval
T
Tingting Gao
Kuaishou Technology
Guorui Zhou
Guorui Zhou
Unknown affiliation
Recommender System,Advertising,Artificial Intelligence,Machine Learning,NLP