Docopilot: Improving Multimodal Models for Document-Level Understanding

📅 2025-07-19

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Current multimodal large language models (MLLMs) exhibit limited performance on complex multi-page document understanding, primarily due to the absence of high-quality document-level multimodal datasets and inherent drawbacks of mainstream RAG approaches—including context fragmentation, error propagation, and high latency. To address these challenges, we introduce Doc-750K, the first high-quality, document-level multimodal dataset explicitly designed for multi-page understanding, featuring cross-page semantic dependencies and fine-grained layout structure annotations. We further propose Docopilot, a native document understanding model that abandons the RAG paradigm in favor of end-to-end cross-page modeling via layout-aware document encoding, cross-page content alignment, and context coherence optimization. Extensive experiments demonstrate that Docopilot achieves significant improvements over state-of-the-art methods across diverse document understanding benchmarks, establishing a new standard for document-level multimodal understanding.

Technology Category

Application Category

📝 Abstract

Despite significant progress in multimodal large language models (MLLMs), their performance on complex, multi-page document comprehension remains inadequate, largely due to the lack of high-quality, document-level datasets. While current retrieval-augmented generation (RAG) methods offer partial solutions, they suffer from issues, such as fragmented retrieval contexts, multi-stage error accumulation, and extra time costs of retrieval. In this work, we present a high-quality document-level dataset, Doc-750K, designed to support in-depth understanding of multimodal documents. This dataset includes diverse document structures, extensive cross-page dependencies, and real question-answer pairs derived from the original documents. Building on the dataset, we develop a native multimodal model, Docopilot, which can accurately handle document-level dependencies without relying on RAG. Experiments demonstrate that Docopilot achieves superior coherence, accuracy, and efficiency in document understanding tasks and multi-turn interactions, setting a new baseline for document-level multimodal understanding. Data, code, and models are released at https://github.com/OpenGVLab/Docopilot

Problem

Research questions and friction points this paper is trying to address.

Inadequate performance of MLLMs on multi-page document comprehension

Issues with current RAG methods like fragmented retrieval contexts

Lack of high-quality document-level datasets for multimodal understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Develops high-quality document-level dataset Doc-750K

Creates native multimodal model Docopilot

Eliminates need for retrieval-augmented generation

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs

2024-06-26Citations: 4

Authors to Follow