🤖 AI Summary
This work addresses a critical gap in current evaluation methodologies, which often overlook the realistic requirement of retrieving evidence from large-scale heterogeneous multimodal corpora and performing subsequent reasoning, thereby overestimating models’ end-to-end capabilities. To bridge this gap, the authors introduce the first large-scale benchmark that jointly evaluates cross-modal retrieval and reasoning, comprising 46,000 multimodal candidates and 747 verifiable questions that demand fine-grained evidence localization across documents, images, and videos followed by precise inference. The study integrates state-of-the-art multimodal retrievers (e.g., E5-V) with multimodal large language models (e.g., GPT-5) within a unified framework to assess holistic performance. Experimental results reveal that even the strongest retriever achieves only 40.8% Recall@1, while top-performing MLLMs attain 80.86% reasoning accuracy when provided with ground-truth evidence—yet their end-to-end accuracy drops sharply to 51.4%, underscoring retrieval as the primary bottleneck in current multimodal systems.
📝 Abstract
Multimodal large language models (MLLMs) achieve strong performance on benchmarks that evaluate text, image, or video understanding separately. However, these settings do not assess a critical real-world requirement, which involves retrieving relevant evidence from large, heterogeneous multimodal corpora prior to reasoning. Most existing benchmarks restrict retrieval to small, single-modality candidate sets, substantially simplifying the search space and overstating end-to-end reliability. To address this gap, we introduce MultiHaystack, the first benchmark designed to evaluate both retrieval and reasoning under large-scale, cross-modal conditions. MultiHaystack comprises over 46,000 multimodal retrieval candidates across documents, images, and videos, along with 747 open yet verifiable questions. Each question is grounded in a unique validated evidence item within the retrieval pool, requiring evidence localization across modalities and fine-grained reasoning. In our study, we find that models perform competitively when provided with the corresponding evidence, but their performance drops sharply when required to retrieve that evidence from the full corpus. Additionally, even the strongest retriever, E5-V, achieves only 40.8% Recall@1, while state-of-the-art MLLMs such as GPT-5 experience a significant drop in reasoning accuracy from 80.86% when provided with the corresponding evidence to 51.4% under top-5 retrieval. These results indicate that multimodal retrieval over heterogeneous pools remains a primary bottleneck for MLLMs, positioning MultiHaystack as a valuable testbed that highlights underlying limitations obscured by small-scale evaluations and promotes retrieval-centric advances in multimodal systems.