A Bounding Box is Worth One Token: Interleaving Layout and Text in a Large Language Model for Document Understanding

📅 2024-07-02

🏛️ arXiv.org

📈 Citations: 18

✨ Influential: 4

career value

168K/year

🤖 AI Summary

Existing document understanding methods struggle to jointly model spatial layout and textual semantics, often suffering from long-sequence bottlenecks or undermining the autoregressive capabilities of large language models (LLMs). To address this, we propose a “one-bounding-box-one-token” layout encoding paradigm: each text box is mapped to a single layout token, interleaved with word tokens in the input sequence, enabling fine-grained, equivalent, and autoregressive joint modeling of layout and text. Our method operates natively within standard LLM architectures, employing token-level layout embedding projection and end-to-end fine-tuning—without modifying the Transformer structure. On key information extraction (KIE), it outperforms state-of-the-art multimodal LLMs by 27.2% and OCR-augmented LLMs by 15.1%; on document visual question answering (VQA), it achieves a 12.0% improvement. This advances both understanding accuracy and modeling efficiency in document intelligence.

Technology Category

Application Category

📝 Abstract

Recently, many studies have demonstrated that exclusively incorporating OCR-derived text and spatial layouts with large language models (LLMs) can be highly effective for document understanding tasks. However, existing methods that integrate spatial layouts with text have limitations, such as producing overly long text sequences or failing to fully leverage the autoregressive traits of LLMs. In this work, we introduce Interleaving Layout and Text in a Large Language Model (LayTextLLM)} for document understanding. In particular, LayTextLLM projects each bounding box to a single embedding and interleaves it with text, efficiently avoiding long sequence issues while leveraging autoregressive traits of LLMs. LayTextLLM not only streamlines the interaction of layout and textual data but also shows enhanced performance in Key Information Extraction (KIE) and Visual Question Answering (VQA). Comprehensive benchmark evaluations reveal significant improvements, with a 27.2% increase on KIE tasks and 12.0% on VQA tasks compared to previous state-of-the-art document understanding MLLMs, as well as a 15.1% improvement over other SOTA OCR-based LLMs on KIE tasks.

Problem

Research questions and friction points this paper is trying to address.

Integrating spatial layouts with text in LLMs efficiently

Avoiding long sequence issues in document understanding

Enhancing performance in KIE and VQA tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Projects bounding boxes to single embeddings

Interleaves layout and text efficiently

Leverages autoregressive traits of LLMs

🔎 Similar Papers

DocLayLLM: An Efficient Multi-modal Extension of Large Language Models for Text-rich Document Understanding