GDI-Bench: A Benchmark for General Document Intelligence with Vision and Reasoning Decoupling

📅 2025-04-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current multimodal large language models (MLLMs) for document understanding lack fine-grained, attributable evaluation benchmarks, hindering precise identification of capability bottlenecks and systematic optimization. To address this, we propose GDI-Bench—the first fine-grained benchmark for general-purpose document intelligence—covering 9 real-world scenarios, 19 diverse tasks, and 1.9K document images. We introduce a novel decoupled evaluation paradigm that separately assesses visual perception and reasoning capabilities, enabling complexity-aware difficulty grading. Additionally, we design an intelligent knowledge-preserving training strategy to mitigate catastrophic forgetting during supervised fine-tuning (SFT). Leveraging GDI-Bench, we uncover significant visual perception deficiencies in state-of-the-art models such as GPT-4o. Our custom-built GDI model achieves SOTA performance on GDI-Bench and multiple established benchmarks, demonstrating both diagnostic utility and modeling advancement.

Technology Category

Application Category

📝 Abstract
The rapid advancement of multimodal large language models (MLLMs) has profoundly impacted the document domain, creating a wide array of application scenarios. This progress highlights the need for a comprehensive benchmark to evaluate these models' capabilities across various document-specific tasks. However, existing benchmarks often fail to locate specific model weaknesses or guide systematic improvements. To bridge this gap, we introduce a General Document Intelligence Benchmark (GDI-Bench), featuring 1.9k images across 9 key scenarios and 19 document-specific tasks. By decoupling visual complexity and reasoning complexity, the GDI-Bench structures graded tasks that allow performance assessment by difficulty, aiding in model weakness identification and optimization guidance. We evaluate the GDI-Bench on various open-source and closed-source models, conducting decoupled analyses in the visual and reasoning domains. For instance, the GPT-4o model excels in reasoning tasks but exhibits limitations in visual capabilities. To address the diverse tasks and domains in the GDI-Bench, we propose a GDI Model that mitigates the issue of catastrophic forgetting during the supervised fine-tuning (SFT) process through a intelligence-preserving training strategy. Our model achieves state-of-the-art performance on previous benchmarks and the GDI-Bench. Both our benchmark and model will be open source.
Problem

Research questions and friction points this paper is trying to address.

Need for comprehensive benchmark to evaluate multimodal document intelligence models
Existing benchmarks lack systematic model weakness identification and improvement guidance
Proposing GDI-Bench to decouple visual and reasoning complexities for graded assessment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decouples visual and reasoning complexity for assessment
Introduces intelligence-preserving training to prevent forgetting
Features 1.9k images across 9 scenarios and 19 tasks
S
Siqi Li
Shanghai Artificial Intelligence Laboratory, Zhejiang University
Yufan Shen
Yufan Shen
Zhejiang University
MLLMGUI Agent
X
Xiangnan Chen
Shanghai Artificial Intelligence Laboratory, Zhejiang University
J
Jiayi Chen
School of Science and Engineering, The Chinese University of Hong Kong
H
Hengwei Ju
Shanghai Artificial Intelligence Laboratory, Fudan University
Haodong Duan
Haodong Duan
Shanghai AI Lab | CUHK | PKU
Computer VisionVideo UnderstandingMultimodal LearningGenerative AI
S
Song Mao
Shanghai Artificial Intelligence Laboratory
Hongbin Zhou
Hongbin Zhou
Shanghai AI Laboratory
B
Bo Zhang
Shanghai Artificial Intelligence Laboratory
Pinlong Cai
Pinlong Cai
Shanghai Artificial Intelligence Laboratory
Artificial IntelligenceDecision IntelligenceKnowledge Systems
Licheng Wen
Licheng Wen
Shanghai AI Laboratory
AI AgentsAutonomous DrivingRobotics
Botian Shi
Botian Shi
Shanghai Artificial Intelligence Laboratory
VLMsDocument UnderstandingAutonomous Driving
Y
Yong Liu
Shanghai Artificial Intelligence Laboratory
Xinyu Cai
Xinyu Cai
Shanghai Artificial Intelligence Laboratory
Artificial IntelligenceAutonomous Driving
Y
Yu Qiao
Shanghai Artificial Intelligence Laboratory