🤖 AI Summary
This work addresses the performance limitations of multimodal retrieval systems when matching image–text hybrid queries to purely textual corpora, a challenge primarily caused by the entanglement of visual descriptions, conversational noise, and retrieval intent within queries. The authors propose BRIDGE, a novel framework that identifies query alignment as the critical bottleneck and introduces a plug-and-play solution that operates without multimodal encoders. BRIDGE employs FORGE, a reinforcement learning–trained query refinement model that generates compact, retrieval-oriented text, coupled with LENS, an inference-enhanced dense retriever designed to handle high-intent-complexity queries. Evaluated on the MM-BRIGHT benchmark, BRIDGE achieves an nDCG@10 of 29.7, outperforming all existing multimodal baselines; when integrated with Nomic-Vision, its performance further improves to 33.3, surpassing the best purely text-based retriever (32.2).
📝 Abstract
Multimodal retrieval systems struggle to resolve image-text queries against text-only corpora: the best vision-language encoder achieves only 27.6 nDCG@10 on MM-BRIGHT, underperforming strong text-only retrievers. We argue the bottleneck is not the retriever but the query -- raw multimodal queries entangle visual descriptions, conversational noise, and retrieval intent in ways that systematically degrade embedding similarity. We present \textbf{BRIDGE}, a two-component system that resolves this mismatch without multimodal encoders. \textbf{FORGE} (\textbf{F}ocused Retrieval Query Generato\textbf{r}) is a query alignment model trained via reinforcement learning, which distills noisy multimodal queries into compact, retrieval-optimized search strings. \textbf{LENS} (\textbf{L}anguage-\textbf{E}nhanced \textbf{N}eural \textbf{S}earch) is a reasoning-enhanced dense retriever fine-tuned on reasoning-intensive retrieval data to handle the intent-rich queries FORGE produces. Evaluated on MM-BRIGHT (2,803 queries, 29 domains), BRIDGE achieves \textbf{29.7} nDCG@10, surpassing all multimodal encoder baselines including Nomic-Vision (27.6). When FORGE is applied as a plug-and-play aligner on top of Nomic-Vision, the combined system reaches \textbf{33.3} nDCG@10 -- exceeding the best text-only retriever (32.2) -- demonstrating that \textit{query alignment} is the key bottleneck in multimodal-to-text retrieval. https://github.com/mm-bright/multimodal-reasoning-retrieval