TRACE: Task-Adaptive Reasoning and Representation Learning for Universal Multimodal Retrieval

📅 2026-03-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing general-purpose multimodal retrieval models in effectively handling diverse user intents—ranging from simple keywords to complex compositional instructions—particularly when queries require logical reasoning. To this end, the authors propose an end-to-end framework that integrates generative reasoning with discriminative representation learning. The approach leverages a multimodal large language model to generate structured chains of thought (CoT) that explicitly parse query intent, which are then compressed into compact embeddings. A difficulty-aware routing mechanism dynamically decides whether to activate or bypass the reasoning module, thereby balancing accuracy and efficiency. Evaluated on the M-BEIR benchmark, the method achieves a new state of the art, significantly improving performance on complex query understanding, inference efficiency, and cross-domain zero-shot generalization.

Technology Category

Application Category

📝 Abstract
Universal Multimodal Retrieval requires unified embedding models capable of interpreting diverse user intents, ranging from simple keywords to complex compositional instructions. While Multimodal Large Language Models (MLLMs) possess strong reasoning capabilities, prevailing adaptations confine them to static encoders, underutilizing their generative potential. This encoder-only paradigm struggles with complex intents that demand logical deduction rather than superficial pattern matching. To address this, we introduce TRACE (Task-adaptive Reasoning And Compressing Embeddings). TRACE unifies generative reasoning with discriminative representation learning. It first generates a structured Chain-of-Thought (CoT) to explicitly reason about the query, and subsequently compresses this reasoning trace into a compact embedding via a dedicated token. To train this framework, we construct M-BEIR-CoT, a large-scale dataset featuring a difficulty-aware routing strategy. Experiments on the M-BEIR benchmark establish TRACE as the new state-of-the-art. Crucially, TRACE demonstrates a learned implicit routing behavior. It autonomously activates reasoning for complex queries while bypassing it for simpler ones, achieving an optimal balance between retrieval accuracy and inference throughput. Furthermore, by internalizing the deductive process, TRACE exhibits remarkable zero-shot transferability to unseen domains and novel constraints.
Problem

Research questions and friction points this paper is trying to address.

Universal Multimodal Retrieval
Multimodal Large Language Models
Complex Query Understanding
Reasoning
Embedding Models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Chain-of-Thought Reasoning
Multimodal Retrieval
Task-Adaptive Embedding
Generative-Discriminative Fusion
Zero-Shot Transfer
🔎 Similar Papers
No similar papers found.
Xiangzhao Hao
Xiangzhao Hao
Institute of Automation, Chinese Academy of Sciences
Multimodal Large Language ModelsReinforcement LearningMultimodal Retrieval
S
Shijie Wang
Institute of Automation, Chinese Academy of Sciences; University of Chinese Academy of Sciences
T
Tianyu Yang
Institute of Automation, Chinese Academy of Sciences; University of Chinese Academy of Sciences
Tianyue Wang
Tianyue Wang
Zhejiang University
AI4ScicenceLoop PredictionProtein Design
Haiyun Guo
Haiyun Guo
Rice University ECE Ph.D.
optical imagingcomputational photographyMetalens
J
Jinqiao Wang
Institute of Automation, Chinese Academy of Sciences; University of Chinese Academy of Sciences