MERRIN: A Benchmark for Multimodal Evidence Retrieval and Reasoning in Noisy Web Environments

📅 2026-04-14
📈 Citations: 0
Influential: 0
📄 PDF

career value

233K/year
🤖 AI Summary
This work addresses the challenge of multi-hop question answering by AI agents in real-world noisy network environments where queries lack explicit modality cues. It introduces MERRIN, the first human-annotated benchmark supporting under-explored modalities such as video and audio. MERRIN evaluates agents’ ability to identify relevant modalities, retrieve multimodal evidence, and perform multi-hop reasoning amidst heterogeneous and conflicting search results. The study benchmarks ten state-of-the-art models—including GPT-5.4-mini, Gemini, and the Qwen3 series—under three settings: no search, native search, and agent-based search. Results reveal that even the best-performing model achieves only 40.1% accuracy, substantially below human performance and at higher computational cost, highlighting significant gaps in current systems’ robustness for cross-modal retrieval and reasoning.

Technology Category

Application Category

📝 Abstract
Motivated by the underspecified, multi-hop nature of search queries and the multimodal, heterogeneous, and often conflicting nature of real-world web results, we introduce MERRIN (Multimodal Evidence Retrieval and Reasoning in Noisy Web Environments), a human-annotated benchmark for evaluating search-augmented agents. MERRIN measures AI agents' ability to identify relevant modalities, retrieve multimodal evidence, and perform multi-hop reasoning over noisy web sources. It differs from prior work in three important aspects: (1) using natural language queries without explicit modality cues, (2) incorporating underexplored modalities such as video and audio, and (3) requiring the retrieval of complex, often noisy or conflicting multimodal evidence during web search. We evaluate diverse search agents powered by ten models, including strong closed-source models (e.g., GPT-5.4-mini, Gemini 3/3.1 Flash/Pro) and open-weight models (Qwen3-4B/30B/235B), across three search settings (no search, native search, and agentic search). Our results show that MERRIN is highly challenging: the average accuracy across all agents is 22.3%, with the best-performing agent reaching only 40.1%. We further observe that while stronger agents like Gemini Deep Research achieve higher performance, gains are modest due to over-exploration; they take more steps and use more tools, but are often distracted by conflicting or partially relevant web content, leading to incorrect answers. Compared to humans, these agents consume more resources yet achieve lower accuracy, largely due to inefficient source selection and an overreliance on text modalities. These findings highlight the need for search agents capable of robust search and reasoning across diverse modalities in noisy web environments, making MERRIN a valuable testbed for evaluating such capabilities.
Problem

Research questions and friction points this paper is trying to address.

multimodal evidence retrieval
noisy web environments
multi-hop reasoning
search-augmented agents
modality selection
Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal retrieval
noisy web environments
multi-hop reasoning
search-augmented agents
evidence reasoning