'Eclair -- Extracting Content and Layout with Integrated Reading Order for Documents

📅 2025-02-06

📈 Citations: 0

✨ Influential: 0

career value

156K/year

🤖 AI Summary

Current OCR systems for complex documents yield only plain text, lacking structural awareness and semantic understanding. To address this, we propose the first end-to-end unified modeling framework that jointly performs text recognition, visual layout localization (with bounding boxes), fine-grained semantic classification (e.g., formulas, tables, footnotes, figure captions), and cross-page reading order prediction. Our method employs a multi-task Transformer architecture that jointly optimizes visual encoding and sequence decoding. Key contributions are: (1) the first unified formulation integrating reading order, layout, and semantics into a single model; (2) the construction of DocSem, the first high-quality, human-annotated, document-level OCR–semantic joint benchmark; and (3) state-of-the-art performance on DocSem and significantly improved generalization on public benchmarks—including PubLayNet and DocBank—demonstrating robust cross-dataset transferability.

Technology Category

Application Category

📝 Abstract

Optical Character Recognition (OCR) technology is widely used to extract text from images of documents, facilitating efficient digitization and data retrieval. However, merely extracting text is insufficient when dealing with complex documents. Fully comprehending such documents requires an understanding of their structure -- including formatting, formulas, tables, and the reading order of multiple blocks and columns across multiple pages -- as well as semantic information for detecting elements like footnotes and image captions. This comprehensive understanding is crucial for downstream tasks such as retrieval, document question answering, and data curation for training Large Language Models (LLMs) and Vision Language Models (VLMs). To address this, we introduce 'Eclair, a general-purpose text-extraction tool specifically designed to process a wide range of document types. Given an image, 'Eclair is able to extract formatted text in reading order, along with bounding boxes and their corresponding semantic classes. To thoroughly evaluate these novel capabilities, we introduce our diverse human-annotated benchmark for document-level OCR and semantic classification. 'Eclair achieves state-of-the-art accuracy on this benchmark, outperforming other methods across key metrics. Additionally, we evaluate 'Eclair on established benchmarks, demonstrating its versatility and strength across several evaluation standards.

Problem

Research questions and friction points this paper is trying to address.

Extracts text with layout and reading order

Handles complex document structures and semantics

Improves OCR for downstream AI model training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrated reading order extraction

Semantic class detection

State-of-the-art OCR accuracy

🔎 Similar Papers

READoc: A Unified Benchmark for Realistic Document Structured Extraction