MM-BRIGHT: A Multi-Task Multimodal Benchmark for Reasoning-Intensive Retrieval

📅 2026-01-14

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Existing retrieval benchmarks are largely confined to text-only queries, failing to capture the complexities of real-world scenarios that involve multimodal inputs—such as images—and require sophisticated reasoning. To address this gap, this work proposes MM-BRIGHT, the first multitask multimodal benchmark tailored for reasoning-intensive retrieval. It comprises 2,803 authentic user queries spanning 29 technical domains and introduces four retrieval tasks of increasing complexity. Notably, MM-BRIGHT enables the first unified evaluation of models on both multimodal fusion and multitask performance, revealing substantial limitations in current approaches: mainstream methods like BM25 (nDCG@10 = 8.5) and Nomic-Vision (27.6) significantly underperform even a text-only model, DiVeR (32.2), underscoring the urgency and difficulty of advancing multimodal reasoning in retrieval systems.

Technology Category

Application Category

📝 Abstract

Existing retrieval benchmarks primarily consist of text-based queries where keyword or semantic matching is usually sufficient. Many real-world queries contain multimodal elements, particularly, images such as diagrams, charts, and screenshots that require intensive reasoning to identify relevant documents. To address this gap, we introduce MM-BRIGHT, the first multimodal benchmark for reasoning-intensive retrieval. Our dataset consists of 2,803 real-world queries spanning 29 diverse technical domains, with four tasks of increasing complexity: text-to-text, multimodal-to-text, multimodal-to-image, and multimodal-to-multimodal retrieval. Extensive evaluation reveals that state-of-the-art models struggle across all tasks: BM25 achieves only 8.5 nDCG@10 on text-only retrieval, while the best multimodal model Nomic-Vision reaches just 27.6 nDCG@10 on multimodal-to-text retrieval actually underperforming the best text-only model (DiVeR: 32.2). These results highlight substantial headroom and position MM-BRIGHT as a testbed for next-generation retrieval models that better integrate visual reasoning. Our code and data are available at https://github.com/mm-bright/MM-BRIGHT. See also our official website: https://mm-bright.github.io/.

Problem

Research questions and friction points this paper is trying to address.

multimodal retrieval

reasoning-intensive retrieval

retrieval benchmark

visual reasoning

real-world queries

Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal retrieval

reasoning-intensive

benchmark