BRIDGE: Benchmark for multi-hop Reasoning In long multimodal Documents with Grounded Evidence

📅 2026-03-09

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

Existing question-answering benchmarks primarily emphasize final answer correctness, often neglecting the evaluation of intermediate reasoning processes—particularly the integration of evidence across text, tables, and figures in long, multimodal documents. To address this gap, this work proposes BRIDGE, a multi-hop reasoning benchmark tailored to scientific papers, which introduces fine-grained annotations for multi-hop reasoning paths, supporting both chain-like and fan-out reasoning structures and enabling step-level model diagnostics. Leveraging a multimodal retrieval-augmented generation (RAG) framework combined with large language models, BRIDGE facilitates cross-modal evidence tracing. Experiments reveal that state-of-the-art models exhibit systematic deficiencies in evidence aggregation and cross-modal localization, shortcomings that BRIDGE effectively uncovers—issues otherwise obscured by conventional answer-accuracy metrics.

Technology Category

Application Category

📝 Abstract

Multi-hop question answering (QA) is widely used to evaluate the reasoning capabilities of large language models, yet most benchmarks focus on final answer correctness and overlook intermediate reasoning, especially in long multimodal documents. We introduce BRIDGE, a benchmark for multi-hop reasoning over long scientific papers that require integrating evidence across text, tables, and figures. The dataset supports both chain-like and fan-out structures and provides explicit multi-hop reasoning annotations for step-level evaluation beyond answer accuracy. Experiments with state-of-the-art LLMs and multimodal retrieval-augmented generation (RAG) systems reveal systematic deficiencies in evidence aggregation and grounding that remain hidden under conventional answer-only evaluation. BRIDGE provides a targeted testbed for diagnosing reasoning failures in long multimodal documents.

Problem

Research questions and friction points this paper is trying to address.

multi-hop reasoning

long multimodal documents

scientific papers

evidence aggregation

intermediate reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-hop reasoning

multimodal documents

reasoning annotation