Beyond Global Similarity: Towards Fine-Grained, Multi-Condition Multimodal Retrieval

📅 2026-03-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing cross-modal retrieval benchmarks primarily focus on coarse-grained or single-condition alignment, falling short in addressing real-world user queries that involve multiple constraints and fine-grained specifications expressed in natural language. To bridge this gap, this work proposes MCMR—the first benchmark for multi-condition, fine-grained, and composable cross-modal retrieval—spanning five product domains and emphasizing constraint awareness and interpretability. We employ a multimodal large language model (MLLM) as both the retriever and a pointwise re-ranker, integrating visual features with long-form textual metadata for joint verification. Experiments demonstrate that visual cues dominate top-ranked accuracy, textual metadata enhances ranking stability for long-tail items, and MLLM-based re-ranking substantially improves fine-grained matching performance, thereby filling a critical evaluation gap in complex query scenarios.

Technology Category

Application Category

📝 Abstract
Recent advances in multimodal large language models (MLLMs) have substantially expanded the capabilities of multimodal retrieval, enabling systems to align and retrieve information across visual and textual modalities. Yet, existing benchmarks largely focus on coarse-grained or single-condition alignment, overlooking real-world scenarios where user queries specify multiple interdependent constraints across modalities. To bridge this gap, we introduce MCMR (Multi-Conditional Multimodal Retrieval): a large-scale benchmark designed to evaluate fine-grained, multi-condition cross-modal retrieval under natural-language queries. MCMR spans five product domains: upper and bottom clothing, jewelry, shoes, and furniture. It also preserves rich long-form metadata essential for compositional matching. Each query integrates complementary visual and textual attributes, requiring models to jointly satisfy all specified conditions for relevance. We benchmark a diverse suite of MLLM-based multimodal retrievers and vision-language rerankers to assess their condition-aware reasoning abilities. Experimental results reveal: (i) distinct modality asymmetries across models; (ii) visual cues dominate early-rank precision, while textual metadata stabilizes long-tail ordering; and (iii) MLLM-based pointwise rerankers markedly improve fine-grained matching by explicitly verifying query-candidate consistency. Overall, MCMR establishes a challenging and diagnostic benchmark for advancing multimodal retrieval toward compositional, constraint-aware, and interpretable understanding. Our code and dataset is available at https://github.com/EIT-NLP/MCMR
Problem

Research questions and friction points this paper is trying to address.

multimodal retrieval
multi-condition
fine-grained
cross-modal alignment
compositional matching
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Conditional Multimodal Retrieval
Fine-Grained Alignment
Multimodal Large Language Models
Compositional Matching
Constraint-Aware Retrieval
🔎 Similar Papers
No similar papers found.
Xuan Lu
Xuan Lu
Assistant Professor, University of Arizona
Human-centered Data ScienceHuman-AI CollaborationCausal InferenceFuture of WorkEmoji
K
Kangle Li
Shanghai Jiao Tong University
H
Haohang Huang
Ningbo Key Laboratory of Spatial Intelligence and Digital Derivative
Rui Meng
Rui Meng
Salesforce Research
Machine LearningNatural Language Processing
W
Wenjun Zeng
Institute of Digital Twin, Eastern Institute of Technology, Ningbo
Xiaoyu Shen
Xiaoyu Shen
Eastern Institute of Technology, Ningbo
language modelmulti-modal learningreasoning