DocScope: Benchmarking Verifiable Reasoning for Trustworthy Long-Document Understanding

📅 2026-05-09
📈 Citations: 0
Influential: 0
📄 PDF

career value

172K/year
🤖 AI Summary
This study investigates whether multimodal large language models possess verifiable and trustworthy reasoning capabilities—beyond mere answer accuracy—when processing lengthy, visually rich documents. To this end, the authors formulate long-document question answering as a structured reasoning trajectory prediction task, requiring models to sequentially generate evidence pages, supporting regions, relevant facts, and final answers. They introduce the first hierarchical, verifiable reasoning evaluation framework for long-document understanding, featuring a four-stage decoupled assessment (page localization, region localization, fact extraction, and answer verification) coupled with a human-aligned calibration mechanism. Evaluated across 1,124 questions on 18 models, results reveal that even when answers are correct, the highest rate of complete evidence chain generation is only 29%; region localization emerges as the weakest component, and the number of activated parameters proves a stronger predictor of reasoning performance than total model size.
📝 Abstract
Evaluating whether Multimodal Large Language Models can produce trustworthy, verifiable reasoning over long, visually rich documents requires evaluation beyond end-to-end answer accuracy. We introduce DocScope, a benchmark that formulates long-document QA as a structured reasoning trajectory prediction problem: given a complete PDF document and a question, the model outputs evidence pages, supporting evidence regions, relevant factual statements, and a final answer. We design a four-stage evaluation protocol -- Page Localization, Region Grounding, Fact Extraction, and Answer Verification -- that audits each level of the trajectory independently through inter-stage decoupling, with all judges selected and calibrated via human alignment studies. DocScope comprises 1,124 questions derived from 273 documents, with all hierarchical evidence annotations completed by human annotators. We benchmark 6 proprietary models, 12 open-weight models, and several domain-specific systems. Our experiments reveal that answer accuracy cannot substitute for trajectory-level evaluation: even among correct answers, the highest observed rate of complete evidence chains is only 29\%. Across all models, region grounding remains the weakest trajectory stage. Furthermore, the primary difficulty stems from aggregating evidence dispersed across long distances and multiple document clusters, while an oracle study identifies faithful perception and fact extraction as the dominant capability bottleneck. Cross-architecture comparisons further suggest that activated parameter count matters more than total scale. The benchmark and code will be publicly released at https://github.com/MiliLab/DocScope.
Problem

Research questions and friction points this paper is trying to address.

verifiable reasoning
long-document understanding
multimodal large language models
trustworthy AI
evidence grounding
Innovation

Methods, ideas, or system contributions that make the work stand out.

verifiable reasoning
long-document understanding
structured reasoning trajectory
multimodal LLM evaluation
evidence grounding
🔎 Similar Papers
No similar papers found.
Xiang Feng
Xiang Feng
ShanghaiTech University
Neural Radiance FieldsImage Super ResolutionComputer Vision
J
Jiawei Zhou
School of Computer Science, National Engineering Research Center for Multimedia Software and Hubei Key Laboratory of Multimedia and Network Communication Engineering, Wuhan University, China
Z
Zhangfeng Huang
Alibaba Group, Hangzhou, China
Kewei Wang
Kewei Wang
Alibaba Cloud
MLLM
S
Shanshan Ye
Department of Machine Learning, Mohamed bin Zayed University of Artificial Intelligence, United Arab Emirates
Jinxin Hu
Jinxin Hu
Alibaba
Zulong Chen
Zulong Chen
Director, Alibaba Group
Machine LearningLarge Language ModelSearch&RecommendationNLP
Yong Luo
Yong Luo
Wuhan University
Artifical IntelligenceMachine LearningData MiningPattern Classification and Search
J
Jing Zhang
School of Computer Science, National Engineering Research Center for Multimedia Software and Hubei Key Laboratory of Multimedia and Network Communication Engineering, Wuhan University, China