🤖 AI Summary
Extracting structured information from 2D engineering drawings remains challenging due to poor robustness of conventional OCR in complex layouts and symbol-overlapping scenarios, yielding unstructured outputs with high error rates.
Method: This paper proposes a hybrid framework integrating oriented bounding box (OBB) detection with the Donut document understanding Transformer. We introduce a novel single-model, cross-category joint fine-tuning strategy to mitigate hallucination and enhance generalization. OBB detection is implemented via YOLOv11, trained on a custom nine-class annotated dataset; structured JSON generation is incorporated into the fine-tuning pipeline.
Results: The framework achieves 94.77% accuracy in geometric dimensioning and tolerancing (GD&T) recognition, 100% recall for most classes, an overall F1-score of 97.3%, and reduces hallucination rate to 5.23%. It significantly decreases manual annotation effort and supports industrial-scale deployment.
📝 Abstract
Accurate extraction of key information from 2D engineering drawings is crucial for high-precision manufacturing. Manual extraction is time-consuming and error-prone, while traditional Optical Character Recognition (OCR) techniques often struggle with complex layouts and overlapping symbols, resulting in unstructured outputs. To address these challenges, this paper proposes a novel hybrid deep learning framework for structured information extraction by integrating an oriented bounding box (OBB) detection model with a transformer-based document parsing model (Donut). An in-house annotated dataset is used to train YOLOv11 for detecting nine key categories: Geometric Dimensioning and Tolerancing (GD&T), General Tolerances, Measures, Materials, Notes, Radii, Surface Roughness, Threads, and Title Blocks. Detected OBBs are cropped into images and labeled to fine-tune Donut for structured JSON output. Fine-tuning strategies include a single model trained across all categories and category-specific models. Results show that the single model consistently outperforms category-specific ones across all evaluation metrics, achieving higher precision (94.77% for GD&T), recall (100% for most), and F1 score (97.3%), while reducing hallucination (5.23%). The proposed framework improves accuracy, reduces manual effort, and supports scalable deployment in precision-driven industries.