A Multi-Stage Hybrid Framework for Automated Interpretation of Multi-View Engineering Drawings Using Vision Language Model

📅 2025-10-23

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

To address the challenge of automated semantic parsing of densely annotated multi-view engineering drawings, this paper proposes a three-stage hybrid framework. First, YOLOv11-det performs layout segmentation; second, YOLOv11-obb enables orientation-aware, fine-grained annotation detection; third, two specialized vision-language models—Alphabetical VLM and Numerical VLM—collaboratively achieve OCR-free, end-to-end semantic understanding, respectively handling textual and numerical information. By bypassing traditional OCR bottlenecks, the framework directly outputs structured JSON compatible with CAD and manufacturing systems. Evaluated on a custom dataset, the Alphabetical and Numerical VLMs achieve F1-scores of 0.672 and 0.963, respectively—demonstrating substantial improvements in parsing accuracy and cross-drawing generalization for complex engineering drawings. This work establishes a scalable, robust paradigm for engineering drawing understanding.

Technology Category

Application Category

📝 Abstract

Engineering drawings are fundamental to manufacturing communication, serving as the primary medium for conveying design intent, tolerances, and production details. However, interpreting complex multi-view drawings with dense annotations remains challenging using manual methods, generic optical character recognition (OCR) systems, or traditional deep learning approaches, due to varied layouts, orientations, and mixed symbolic-textual content. To address these challenges, this paper proposes a three-stage hybrid framework for the automated interpretation of 2D multi-view engineering drawings using modern detection and vision language models (VLMs). In the first stage, YOLOv11-det performs layout segmentation to localize key regions such as views, title blocks, and notes. The second stage uses YOLOv11-obb for orientation-aware, fine-grained detection of annotations, including measures, GD&T symbols, and surface roughness indicators. The third stage employs two Donut-based, OCR-free VLMs for semantic content parsing: the Alphabetical VLM extracts textual and categorical information from title blocks and notes, while the Numerical VLM interprets quantitative data such as measures, GD&T frames, and surface roughness. Two specialized datasets were developed to ensure robustness and generalization: 1,000 drawings for layout detection and 1,406 for annotation-level training. The Alphabetical VLM achieved an overall F1 score of 0.672, while the Numerical VLM reached 0.963, demonstrating strong performance in textual and quantitative interpretation, respectively. The unified JSON output enables seamless integration with CAD and manufacturing databases, providing a scalable solution for intelligent engineering drawing analysis.

Problem

Research questions and friction points this paper is trying to address.

Automating interpretation of multi-view engineering drawings

Addressing layout and annotation complexity with hybrid framework

Integrating vision language models for semantic content parsing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-stage hybrid framework automates engineering drawing interpretation

YOLOv11 models segment layouts and detect annotations

Donut-based vision language models parse semantic content OCR-free

🔎 Similar Papers

No similar papers found.