🤖 AI Summary
To address low layout parsing accuracy and poor cross-platform deployability in complex document analysis, this paper proposes five lightweight object detection models based on RT-DETR, RT-DETRv2, and DFINE architectures, augmented with a customized post-processing strategy. Integrated into the Docling document conversion pipeline, these models achieve mAP improvements of 20.6–23.9% across multiple standard benchmarks. The best-performing model, heron-101, attains 78% mAP and 28 ms/image inference latency on an A100 GPU, balancing high accuracy and real-time performance. All models support seamless deployment across CPU, NVIDIA GPU, and Apple GPU platforms. Comprehensive evaluation covers accuracy, throughput, and latency. The complete set of models, training code, and technical documentation is publicly released on Hugging Face, facilitating standardized, reproducible research in document intelligence.
📝 Abstract
This technical report documents the development of novel Layout Analysis models integrated into the Docling document-conversion pipeline. We trained several state-of-the-art object detectors based on the RT-DETR, RT-DETRv2 and DFINE architectures on a heterogeneous corpus of 150,000 documents (both openly available and proprietary). Post-processing steps were applied to the raw detections to make them more applicable to the document conversion task. We evaluated the effectiveness of the layout analysis on various document benchmarks using different methodologies while also measuring the runtime performance across different environments (CPU, Nvidia and Apple GPUs). We introduce five new document layout models achieving 20.6% - 23.9% mAP improvement over Docling's previous baseline, with comparable or better runtime. Our best model, "heron-101", attains 78% mAP with 28 ms/image inference time on a single NVIDIA A100 GPU. Extensive quantitative and qualitative experiments establish best practices for training, evaluating, and deploying document-layout detectors, providing actionable guidance for the document conversion community. All trained checkpoints, code, and documentation are released under a permissive license on HuggingFace.