DocLayLLM: An Efficient Multi-modal Extension of Large Language Models for Text-rich Document Understanding

📅 2024-08-27

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

Existing multimodal large language models (MLLMs) face critical bottlenecks—excessive computational overhead or insufficient vision–language fusion—when tackling text-dense, layout-complex document understanding. To address this, we propose DocLayLLM, a lightweight MLLM with three key innovations: (1) a layout-aware visual patch encoding scheme integrating 2D positional embeddings; (2) native reuse of the LLM’s tokenization and semantic encoding capabilities for OCR-derived text, eliminating the need for dedicated OCR modules; and (3) the first chain-of-thought (CoT) pretraining framework coupled with CoT annealing, deeply embedding structured reasoning into multimodal modeling. Evaluated on multiple text-intensive document understanding benchmarks, DocLayLLM achieves state-of-the-art performance using significantly lower training resources than both OCR-dependent and OCR-free approaches. The code and models are publicly released.

Technology Category

Application Category

📝 Abstract

Text-rich document understanding (TDU) requires comprehensive analysis of documents containing substantial textual content and complex layouts. While Multimodal Large Language Models (MLLMs) have achieved fast progress in this domain, existing approaches either demand significant computational resources or struggle with effective multi-modal integration. In this paper, we introduce DocLayLLM, an efficient multi-modal extension of LLMs specifically designed for TDU. By lightly integrating visual patch tokens and 2D positional tokens into LLMs' input and encoding the document content using the LLMs themselves, we fully take advantage of the document comprehension capability of LLMs and enhance their perception of OCR information. We have also deeply considered the role of chain-of-thought (CoT) and innovatively proposed the techniques of CoT Pre-training and CoT Annealing. Our DocLayLLM can achieve remarkable performances with lightweight training settings, showcasing its efficiency and effectiveness. Experimental results demonstrate that our DocLayLLM outperforms existing OCR-dependent methods and OCR-free competitors. Code and model are available at https://github.com/whlscut/DocLayLLM.

Problem

Research questions and friction points this paper is trying to address.

Efficient multi-modal integration for text-rich documents

Lightweight training for enhanced document comprehension

Improved OCR information perception in document analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates visual and positional tokens into LLMs

Uses CoT Pre-training and CoT Annealing techniques

Achieves high performance with lightweight training

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs