DocHop-QA: Towards Multi-Hop Reasoning over Multimodal Document Collections

📅 2025-08-20

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

Existing QA benchmarks are largely constrained to single-document, single-modal, and shallow-reasoning settings, failing to capture the cross-document, multimodal, and deep multi-hop reasoning required in realistic information retrieval. To address this, we propose DocHop-QA—the first large-scale, multimodal, multi-document, multi-hop QA benchmark grounded in PubMed scientific literature, comprising 11,379 instances that integrate textual, tabular, and layout-aware information. Methodologically, we abandon hyperlink-based navigation in favor of a semantic-similarity-driven open reasoning paradigm and introduce a layout-aware evidence fusion mechanism. The benchmark is constructed via an LLM-driven automated pipeline covering 11 scientific question categories, incorporating multimodal understanding, structured index prediction, and cross-document evidence synthesis. Evaluated on four discriminative and generative tasks, DocHop-QA demonstrates strong challengeability and generalization, significantly advancing the state of complex information retrieval and reasoning benchmarks.

Technology Category

Application Category

📝 Abstract

Despite recent advances in large language models (LLMs), most QA benchmarks are still confined to single-paragraph or single-document settings, failing to capture the complexity of real-world information-seeking tasks. Practical QA often requires multi-hop reasoning over information distributed across multiple documents, modalities, and structural formats. Although prior datasets made progress in this area, they rely heavily on Wikipedia-based content and unimodal plain text, with shallow reasoning paths that typically produce brief phrase-level or single-sentence answers, thus limiting their realism and generalizability. We propose DocHop-QA, a large-scale benchmark comprising 11,379 QA instances for multimodal, multi-document, multi-hop question answering. Constructed from publicly available scientific documents sourced from PubMed, DocHop-QA is domain-agnostic and incorporates diverse information formats, including textual passages, tables, and structural layout cues. Unlike existing datasets, DocHop-QA does not rely on explicitly hyperlinked documents; instead, it supports open-ended reasoning through semantic similarity and layout-aware evidence synthesis. To scale realistic QA construction, we designed an LLM-driven pipeline grounded in 11 high-frequency scientific question concepts. We evaluated DocHop-QA through four tasks spanning structured index prediction, generative answering, and multimodal integration, reflecting both discriminative and generative paradigms. These tasks demonstrate DocHop-QA's capacity to support complex, multimodal reasoning across multiple documents.

Problem

Research questions and friction points this paper is trying to address.

Addresses multi-hop reasoning over multimodal documents

Overcomes limitations of Wikipedia-based unimodal datasets

Supports complex reasoning across text, tables, and layout

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal multi-document reasoning benchmark

LLM-driven scientific question generation pipeline

Semantic similarity layout-aware evidence synthesis

🔎 Similar Papers

Chain-of-Action: Faithful and Multimodal Question Answering through Large Language Models