BRIDGE: Multimodal-to-Text Retrieval via Reinforcement-Learned Query Alignment

📅 2026-04-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the performance limitations of multimodal retrieval systems when matching image–text hybrid queries to purely textual corpora, a challenge primarily caused by the entanglement of visual descriptions, conversational noise, and retrieval intent within queries. The authors propose BRIDGE, a novel framework that identifies query alignment as the critical bottleneck and introduces a plug-and-play solution that operates without multimodal encoders. BRIDGE employs FORGE, a reinforcement learning–trained query refinement model that generates compact, retrieval-oriented text, coupled with LENS, an inference-enhanced dense retriever designed to handle high-intent-complexity queries. Evaluated on the MM-BRIGHT benchmark, BRIDGE achieves an nDCG@10 of 29.7, outperforming all existing multimodal baselines; when integrated with Nomic-Vision, its performance further improves to 33.3, surpassing the best purely text-based retriever (32.2).
📝 Abstract
Multimodal retrieval systems struggle to resolve image-text queries against text-only corpora: the best vision-language encoder achieves only 27.6 nDCG@10 on MM-BRIGHT, underperforming strong text-only retrievers. We argue the bottleneck is not the retriever but the query -- raw multimodal queries entangle visual descriptions, conversational noise, and retrieval intent in ways that systematically degrade embedding similarity. We present \textbf{BRIDGE}, a two-component system that resolves this mismatch without multimodal encoders. \textbf{FORGE} (\textbf{F}ocused Retrieval Query Generato\textbf{r}) is a query alignment model trained via reinforcement learning, which distills noisy multimodal queries into compact, retrieval-optimized search strings. \textbf{LENS} (\textbf{L}anguage-\textbf{E}nhanced \textbf{N}eural \textbf{S}earch) is a reasoning-enhanced dense retriever fine-tuned on reasoning-intensive retrieval data to handle the intent-rich queries FORGE produces. Evaluated on MM-BRIGHT (2,803 queries, 29 domains), BRIDGE achieves \textbf{29.7} nDCG@10, surpassing all multimodal encoder baselines including Nomic-Vision (27.6). When FORGE is applied as a plug-and-play aligner on top of Nomic-Vision, the combined system reaches \textbf{33.3} nDCG@10 -- exceeding the best text-only retriever (32.2) -- demonstrating that \textit{query alignment} is the key bottleneck in multimodal-to-text retrieval. https://github.com/mm-bright/multimodal-reasoning-retrieval
Problem

Research questions and friction points this paper is trying to address.

multimodal retrieval
image-text query
text-only corpus
query alignment
retrieval intent
Innovation

Methods, ideas, or system contributions that make the work stand out.

query alignment
reinforcement learning
multimodal retrieval
dense retrieval
text generation
🔎 Similar Papers
No similar papers found.
M
Mohamed Darwish Mounis
High institute for computer & information systems
M
Mohamed Mahmoud
Chungbuk National University
S
Shaimaa Sedek
Assiut University
Mahmoud Abdalla
Mahmoud Abdalla
Master Student
computer visionnatural language processing
Mahmoud SalahEldin Kasem
Mahmoud SalahEldin Kasem
Teaching and Research Assistant, Assiut University
Artificial IntelligenceMachine LearningDeep Learning
Abdelrahman Abdallah
Abdelrahman Abdallah
Innsbruck University
Question AnsweringLarge Language ModelsInformation RetrievalComputer VisionNLP
H
Hyun-Soo Kang
High institute for computer & information systems