MARVEL: Multimodal Adaptive Reasoning-intensiVe Expand-rerank and retrievaL

📅 2026-04-08

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

This work addresses the challenge of effectively matching complex multimodal queries with textual corpora in reasoning-intensive retrieval tasks. The authors propose a unified expand-retrieve-rerank framework that integrates, for the first time, large language model–driven query intent expansion, a dense retriever fine-tuned for complex multimodal queries (MARVEL-Retriever), and a chain-of-thought–based multi-round reranking mechanism powered by GPT-4o, into an end-to-end pipeline. Evaluated on the MM-BRIGHT benchmark, this paradigm achieves an nDCG@10 of 37.9, outperforming the best multimodal encoder by 10.3 points and demonstrating significant superiority in 27 out of 29 technical domains.

Technology Category

Application Category

📝 Abstract

Multimodal retrieval over text corpora remains a fundamental challenge: the best vision-language encoder achieves only 27.6 nDCG@10 on MM-BRIGHT, a reasoning-intensive multimodal retrieval benchmark, underperforming strong text-only systems. We argue that effective multimodal retrieval requires three tightly integrated capabilities that existing approaches address only in isolation: expanding the query's latent intent, retrieving with a model trained for complex reasoning, and reranking via explicit step-by-step reasoning over candidates. We introduce \textbf{MARVEL} (\textbf{M}ultimodal \textbf{A}daptive \textbf{R}easoning-intensi\textbf{V}e \textbf{E}xpand-rerank and retrieva\textbf{L}), a unified pipeline that combines LLM-driven query expansion, \textbf{MARVEL-Retriever} -- a reasoning-enhanced dense retriever fine-tuned for complex multimodal queries -- and GPT-4o-based chain-of-thought reranking with optional multi-pass reciprocal rank fusion. Evaluated on MM-BRIGHT across 29 technical domains, MARVEL achieves \textbf{37.9} nDCG@10, surpassing the best multimodal encoder by \textbf{+10.3 points} and outperforming all single-stage baselines in 27 of 29 domains and matching or approaching the best baseline in the remaining two highly-specialized domains (Crypto, Quantum Computing), demonstrating that reasoning-intensive multimodal retrieval is best addressed through a unified expand-retrieve-rerank framework. https://github.com/mm-bright/multimodal-reasoning-retrieval

Problem

Research questions and friction points this paper is trying to address.

multimodal retrieval

reasoning-intensive

query expansion

dense retrieval

reranking

Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal retrieval

reasoning-intensive

query expansion