MMDeepResearch-Bench: A Benchmark for Multimodal Deep Research Agents

๐Ÿ“… 2026-01-18
๐Ÿ“ˆ Citations: 1
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing benchmarks struggle to evaluate modelsโ€™ ability to utilize evidence in end-to-end multimodal deep research. To address this gap, this work introduces a novel evaluation benchmark comprising 140 expert-designed tasks, each presenting an imageโ€“text pair and requiring the model to generate a research report grounded in explicit evidence and consistent across modalities. We propose the first comprehensive evaluation framework tailored for report-style multimodal deep research, featuring three fine-grained assessment modules: FLAE (report quality), TRACE (citation alignment), and MOSAIC (multimodal completeness), enabling diagnostic analysis. Experiments across 25 state-of-the-art models reveal systematic trade-offs among generation quality, citation fidelity, and multimodal grounding, highlighting multimodal completeness as a critical bottleneck.

Technology Category

Application Category

๐Ÿ“ Abstract
Deep Research Agents (DRAs) generate citation-rich reports via multi-step search and synthesis, yet existing benchmarks mainly target text-only settings or short-form multimodal QA, missing end-to-end multimodal evidence use. We introduce MMDeepResearch-Bench (MMDR-Bench), a benchmark of 140 expert-crafted tasks across 21 domains, where each task provides an image-text bundle to evaluate multimodal understanding and citation-grounded report generation. Compared to prior setups, MMDR-Bench emphasizes report-style synthesis with explicit evidence use, where models must connect visual artifacts to sourced claims and maintain consistency across narrative, citations, and visual references. We further propose a unified, interpretable evaluation pipeline: Formula-LLM Adaptive Evaluation (FLAE) for report quality, Trustworthy Retrieval-Aligned Citation Evaluation (TRACE) for citation-grounded evidence alignment, and Multimodal Support-Aligned Integrity Check (MOSAIC) for text-visual integrity, each producing fine-grained signals that support error diagnosis beyond a single overall score. Experiments across 25 state-of-the-art models reveal systematic trade-offs between generation quality, citation discipline, and multimodal grounding, highlighting that strong prose alone does not guarantee faithful evidence use and that multimodal integrity remains a key bottleneck for deep research agents.
Problem

Research questions and friction points this paper is trying to address.

Multimodal Deep Research
Citation-grounded Report Generation
Evidence Use
Multimodal Integrity
Benchmark
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Deep Research Agents
MMDeepResearch-Bench
Citation-grounded Report Generation
Multimodal Evidence Alignment
Interpretable Evaluation Pipeline
๐Ÿ”Ž Similar Papers
No similar papers found.