π€ AI Summary
Existing QA benchmarks are largely constrained to single-document, single-modal, and shallow-reasoning settings, failing to capture the cross-document, multimodal, and deep multi-hop reasoning required in realistic information retrieval. To address this, we propose DocHop-QAβthe first large-scale, multimodal, multi-document, multi-hop QA benchmark grounded in PubMed scientific literature, comprising 11,379 instances that integrate textual, tabular, and layout-aware information. Methodologically, we abandon hyperlink-based navigation in favor of a semantic-similarity-driven open reasoning paradigm and introduce a layout-aware evidence fusion mechanism. The benchmark is constructed via an LLM-driven automated pipeline covering 11 scientific question categories, incorporating multimodal understanding, structured index prediction, and cross-document evidence synthesis. Evaluated on four discriminative and generative tasks, DocHop-QA demonstrates strong challengeability and generalization, significantly advancing the state of complex information retrieval and reasoning benchmarks.
π Abstract
Despite recent advances in large language models (LLMs), most QA benchmarks are still confined to single-paragraph or single-document settings, failing to capture the complexity of real-world information-seeking tasks. Practical QA often requires multi-hop reasoning over information distributed across multiple documents, modalities, and structural formats. Although prior datasets made progress in this area, they rely heavily on Wikipedia-based content and unimodal plain text, with shallow reasoning paths that typically produce brief phrase-level or single-sentence answers, thus limiting their realism and generalizability. We propose DocHop-QA, a large-scale benchmark comprising 11,379 QA instances for multimodal, multi-document, multi-hop question answering. Constructed from publicly available scientific documents sourced from PubMed, DocHop-QA is domain-agnostic and incorporates diverse information formats, including textual passages, tables, and structural layout cues. Unlike existing datasets, DocHop-QA does not rely on explicitly hyperlinked documents; instead, it supports open-ended reasoning through semantic similarity and layout-aware evidence synthesis. To scale realistic QA construction, we designed an LLM-driven pipeline grounded in 11 high-frequency scientific question concepts. We evaluated DocHop-QA through four tasks spanning structured index prediction, generative answering, and multimodal integration, reflecting both discriminative and generative paradigms. These tasks demonstrate DocHop-QA's capacity to support complex, multimodal reasoning across multiple documents.