Advanced Layout Analysis Models for Docling

📅 2025-09-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address low layout parsing accuracy and poor cross-platform deployability in complex document analysis, this paper proposes five lightweight object detection models based on RT-DETR, RT-DETRv2, and DFINE architectures, augmented with a customized post-processing strategy. Integrated into the Docling document conversion pipeline, these models achieve mAP improvements of 20.6–23.9% across multiple standard benchmarks. The best-performing model, heron-101, attains 78% mAP and 28 ms/image inference latency on an A100 GPU, balancing high accuracy and real-time performance. All models support seamless deployment across CPU, NVIDIA GPU, and Apple GPU platforms. Comprehensive evaluation covers accuracy, throughput, and latency. The complete set of models, training code, and technical documentation is publicly released on Hugging Face, facilitating standardized, reproducible research in document intelligence.

Technology Category

Application Category

📝 Abstract
This technical report documents the development of novel Layout Analysis models integrated into the Docling document-conversion pipeline. We trained several state-of-the-art object detectors based on the RT-DETR, RT-DETRv2 and DFINE architectures on a heterogeneous corpus of 150,000 documents (both openly available and proprietary). Post-processing steps were applied to the raw detections to make them more applicable to the document conversion task. We evaluated the effectiveness of the layout analysis on various document benchmarks using different methodologies while also measuring the runtime performance across different environments (CPU, Nvidia and Apple GPUs). We introduce five new document layout models achieving 20.6% - 23.9% mAP improvement over Docling's previous baseline, with comparable or better runtime. Our best model, "heron-101", attains 78% mAP with 28 ms/image inference time on a single NVIDIA A100 GPU. Extensive quantitative and qualitative experiments establish best practices for training, evaluating, and deploying document-layout detectors, providing actionable guidance for the document conversion community. All trained checkpoints, code, and documentation are released under a permissive license on HuggingFace.
Problem

Research questions and friction points this paper is trying to address.

Develops advanced layout analysis models for document conversion
Trains object detectors on 150,000 documents to improve accuracy
Evaluates performance across different hardware environments and benchmarks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Trained RT-DETR and DFINE object detectors
Applied post-processing to raw detections
Introduced five new document layout models
🔎 Similar Papers
Nikolaos Livathinos
Nikolaos Livathinos
IBM Research
Computer VisionAISoftware Architecture
Christoph Auer
Christoph Auer
IBM Research
A
Ahmed Nassar
IBM Research, Rüschlikon, Switzerland
R
Rafael Teixeira de Lima
IBM Research, Rüschlikon, Switzerland
Maksym Lysak
Maksym Lysak
IBM
artificial intelligencecomputer vision3d graphicsartshistory
B
Brown Ebouky
IBM Research, Rüschlikon, Switzerland
Cesar Berrospi
Cesar Berrospi
Senior Research Scientist, IBM Research
information retrievalrecommender systemsmachine learningdeep learningnatural language processing
Michele Dolfi
Michele Dolfi
IBM Research
Knowledge ingestionCloud computingComputational physicsTensor networksHigh performance computing
P
Panagiotis Vagenas
IBM Research, Rüschlikon, Switzerland
M
Matteo Omenetti
IBM Research, Rüschlikon, Switzerland
Kasper Dinkla
Kasper Dinkla
IBM Research
Y
Yusik Kim
IBM Research, Rüschlikon, Switzerland
V
Valery Weber
IBM Research, Rüschlikon, Switzerland
L
Lucas Morin
IBM Research, Rüschlikon, Switzerland
I
Ingmar Meijer
IBM Research, Rüschlikon, Switzerland
V
Viktor Kuropiatnyk
IBM Research, Rüschlikon, Switzerland
T
Tim Strohmeyer
IBM Research, Rüschlikon, Switzerland
A
A. Said Gurbuz
IBM Research, Rüschlikon, Switzerland
P
Peter W. J. Staar
IBM Research, Rüschlikon, Switzerland