🤖 AI Summary
Large language models (LLMs) struggle to jointly reason over layout, structural, and semantic information in visually rich documents (e.g., charts, tables), limiting their effectiveness in visual document question answering (VrDU-QA). Method: This paper presents a systematic survey of LLM-enhanced VrDU, proposing the first taxonomy of LLM-driven VrDU-QA methods and distilling three core challenges: layout awareness, structural modeling, and cross-modal alignment. We introduce a unified framework integrating OCR, layout analysis, vision encoders, and LLMs, augmented by instruction tuning, visual prompt learning, and multi-stage alignment strategies. Contribution/Results: Based on a synthesis of 120+ works, we establish the first comprehensive evaluation framework for VrDU, uncovering critical performance bottlenecks. We further identify key future directions—including lightweight deployment, interpretable inference, and few-shot generalization—to advance robust, efficient, and explainable VrDU systems.
📝 Abstract
Using Large Language Models (LLMs) for Visually-rich Document Understanding (VrDU) has significantly improved performance on tasks requiring both comprehension and generation, such as question answering, albeit introducing new challenges. This survey explains how VrDU models enhanced by LLMs function, covering methods for integrating VrD features into LLMs and highlighting key challenges.