DoPTA: Improving Document Layout Analysis using Patch-Text Alignment

📅 2024-12-17
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses error accumulation and computational overhead in document layout analysis caused by OCR dependency, proposing an OCR-free end-to-end multimodal approach. Methodologically, it introduces (1) the first document-oriented Patch-Text alignment pretraining paradigm, enabling fine-grained implicit cross-modal alignment between image patches and textual semantics via cross-attention; (2) integrated multimodal contrastive learning, masked image modeling, and self-supervised pretraining to enhance layout awareness; and (3) a pure vision encoder that jointly models visual structure and semantic content without language modeling components. Evaluated on D4LA and FUNSD, the method achieves state-of-the-art performance—outperforming larger-parameter models—while reducing pretraining computation significantly. Crucially, it enables zero-OCR downstream inference, eliminating OCR-induced errors and latency.

Technology Category

Application Category

📝 Abstract
The advent of multimodal learning has brought a significant improvement in document AI. Documents are now treated as multimodal entities, incorporating both textual and visual information for downstream analysis. However, works in this space are often focused on the textual aspect, using the visual space as auxiliary information. While some works have explored pure vision based techniques for document image understanding, they require OCR identified text as input during inference, or do not align with text in their learning procedure. Therefore, we present a novel image-text alignment technique specially designed for leveraging the textual information in document images to improve performance on visual tasks. Our document encoder model DoPTA - trained with this technique demonstrates strong performance on a wide range of document image understanding tasks, without requiring OCR during inference. Combined with an auxiliary reconstruction objective, DoPTA consistently outperforms larger models, while using significantly lesser pre-training compute. DoPTA also sets new state-of-the art results on D4LA, and FUNSD, two challenging document visual analysis benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Improves document layout analysis using image-text alignment.
Eliminates the need for OCR during inference in document understanding.
Achieves state-of-the-art results on document visual analysis benchmarks.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Novel image-text alignment technique
No OCR required during inference
Auxiliary reconstruction objective enhances performance
🔎 Similar Papers
No similar papers found.
R
R. NikithaS.
Media and Data Science Research Lab, Adobe
Tarun Ram Menta
Tarun Ram Menta
Datalab
Computer VisionMultimodal Learning
Mausoom Sarkar
Mausoom Sarkar
Adobe