🤖 AI Summary
Current multimodal large language models (MLLMs) exhibit limited performance on complex multi-page document understanding, primarily due to the absence of high-quality document-level multimodal datasets and inherent drawbacks of mainstream RAG approaches—including context fragmentation, error propagation, and high latency. To address these challenges, we introduce Doc-750K, the first high-quality, document-level multimodal dataset explicitly designed for multi-page understanding, featuring cross-page semantic dependencies and fine-grained layout structure annotations. We further propose Docopilot, a native document understanding model that abandons the RAG paradigm in favor of end-to-end cross-page modeling via layout-aware document encoding, cross-page content alignment, and context coherence optimization. Extensive experiments demonstrate that Docopilot achieves significant improvements over state-of-the-art methods across diverse document understanding benchmarks, establishing a new standard for document-level multimodal understanding.
📝 Abstract
Despite significant progress in multimodal large language models (MLLMs), their performance on complex, multi-page document comprehension remains inadequate, largely due to the lack of high-quality, document-level datasets. While current retrieval-augmented generation (RAG) methods offer partial solutions, they suffer from issues, such as fragmented retrieval contexts, multi-stage error accumulation, and extra time costs of retrieval. In this work, we present a high-quality document-level dataset, Doc-750K, designed to support in-depth understanding of multimodal documents. This dataset includes diverse document structures, extensive cross-page dependencies, and real question-answer pairs derived from the original documents. Building on the dataset, we develop a native multimodal model, Docopilot, which can accurately handle document-level dependencies without relying on RAG. Experiments demonstrate that Docopilot achieves superior coherence, accuracy, and efficiency in document understanding tasks and multi-turn interactions, setting a new baseline for document-level multimodal understanding. Data, code, and models are released at https://github.com/OpenGVLab/Docopilot