Survey on Question Answering over Visually Rich Documents: Methods, Challenges, and Trends

📅 2025-01-04

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

Large language models (LLMs) struggle to jointly reason over layout, structural, and semantic information in visually rich documents (e.g., charts, tables), limiting their effectiveness in visual document question answering (VrDU-QA). Method: This paper presents a systematic survey of LLM-enhanced VrDU, proposing the first taxonomy of LLM-driven VrDU-QA methods and distilling three core challenges: layout awareness, structural modeling, and cross-modal alignment. We introduce a unified framework integrating OCR, layout analysis, vision encoders, and LLMs, augmented by instruction tuning, visual prompt learning, and multi-stage alignment strategies. Contribution/Results: Based on a synthesis of 120+ works, we establish the first comprehensive evaluation framework for VrDU, uncovering critical performance bottlenecks. We further identify key future directions—including lightweight deployment, interpretable inference, and few-shot generalization—to advance robust, efficient, and explainable VrDU systems.

Technology Category

Application Category

📝 Abstract

Using Large Language Models (LLMs) for Visually-rich Document Understanding (VrDU) has significantly improved performance on tasks requiring both comprehension and generation, such as question answering, albeit introducing new challenges. This survey explains how VrDU models enhanced by LLMs function, covering methods for integrating VrD features into LLMs and highlighting key challenges.

Problem

Research questions and friction points this paper is trying to address.

Information Retrieval

Large Language Models

Image-rich Documents

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Language Models

Visual Information Integration

Enhanced Text Comprehension

🔎 Similar Papers

No similar papers found.