Document Intelligence in the Era of Large Language Models: A Survey

📅 2025-10-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses three core challenges in Large Language Model (LLM)-driven Document AI (DAI): multimodal document understanding, multilingual adaptation, and retrieval-augmented generation (RAG). To tackle these, we propose a novel “agent-driven document processing” paradigm that emphasizes task decomposition, dynamic tool invocation, and adaptive planning. We further advocate for lightweight, document-centric foundation models—built upon decoder-only architectures—to improve structured understanding and cross-format generalization. By tightly integrating multimodal fusion, RAG, and agent-based reasoning, our approach significantly enhances semantic parsing and controllable generation for complex, heterogeneous documents. This study delivers the first structured technical taxonomy for DAI, bridging theoretical rigor with practical engineering insights. It provides both a comprehensive conceptual framework and actionable design principles, thereby advancing both academic research and industrial deployment in the DAI domain.

Technology Category

Application Category

📝 Abstract
Document AI (DAI) has emerged as a vital application area, and is significantly transformed by the advent of large language models (LLMs). While earlier approaches relied on encoder-decoder architectures, decoder-only LLMs have revolutionized DAI, bringing remarkable advancements in understanding and generation. This survey provides a comprehensive overview of DAI's evolution, highlighting current research attempts and future prospects of LLMs in this field. We explore key advancements and challenges in multimodal, multilingual, and retrieval-augmented DAI, while also suggesting future research directions, including agent-based approaches and document-specific foundation models. This paper aims to provide a structured analysis of the state-of-the-art in DAI and its implications for both academic and practical applications.
Problem

Research questions and friction points this paper is trying to address.

Surveying Document AI evolution transformed by large language models
Exploring multimodal multilingual retrieval challenges in document intelligence
Analyzing future directions like agent-based document foundation models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decoder-only LLMs revolutionize document AI
Multimodal multilingual retrieval-augmented approaches advance DAI
Agent-based document foundation models guide future research