Document Intelligence in the Era of Large Language Models: A Survey

📅 2025-10-15

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This work addresses three core challenges in Large Language Model (LLM)-driven Document AI (DAI): multimodal document understanding, multilingual adaptation, and retrieval-augmented generation (RAG). To tackle these, we propose a novel “agent-driven document processing” paradigm that emphasizes task decomposition, dynamic tool invocation, and adaptive planning. We further advocate for lightweight, document-centric foundation models—built upon decoder-only architectures—to improve structured understanding and cross-format generalization. By tightly integrating multimodal fusion, RAG, and agent-based reasoning, our approach significantly enhances semantic parsing and controllable generation for complex, heterogeneous documents. This study delivers the first structured technical taxonomy for DAI, bridging theoretical rigor with practical engineering insights. It provides both a comprehensive conceptual framework and actionable design principles, thereby advancing both academic research and industrial deployment in the DAI domain.

Technology Category

Application Category

📝 Abstract

Document AI (DAI) has emerged as a vital application area, and is significantly transformed by the advent of large language models (LLMs). While earlier approaches relied on encoder-decoder architectures, decoder-only LLMs have revolutionized DAI, bringing remarkable advancements in understanding and generation. This survey provides a comprehensive overview of DAI's evolution, highlighting current research attempts and future prospects of LLMs in this field. We explore key advancements and challenges in multimodal, multilingual, and retrieval-augmented DAI, while also suggesting future research directions, including agent-based approaches and document-specific foundation models. This paper aims to provide a structured analysis of the state-of-the-art in DAI and its implications for both academic and practical applications.

Problem

Research questions and friction points this paper is trying to address.

Surveying Document AI evolution transformed by large language models

Exploring multimodal multilingual retrieval challenges in document intelligence

Analyzing future directions like agent-based document foundation models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decoder-only LLMs revolutionize document AI

Multimodal multilingual retrieval-augmented approaches advance DAI

Agent-based document foundation models guide future research

🔎 Similar Papers

Leveraging Large Language Models for Relevance Judgments in Legal Case Retrieval