🤖 AI Summary
This work addresses the structural blind spots of multimodal large language models (MLLMs) in interpreting engineering drawings, particularly their inability to capture topological structures and symbolic logic. To overcome this limitation, the authors propose Vector-to-Graph (V2G), a novel approach that explicitly models CAD vector drawings as attributed graphs, where nodes and edges precisely represent components and their interconnections. By moving beyond conventional pixel-driven paradigms, V2G endows MLLMs with engineering-level structural reasoning capabilities. The method is evaluated on an electrical compliance diagnosis benchmark, demonstrating significant performance gains across all error categories compared to existing MLLMs, whose accuracy remains near random levels.
📝 Abstract
Multimodal Large Language Models (MLLMs) have shown remarkable progress in visual understanding, yet they suffer from a critical limitation: structural blindness. Even state-of-the-art models fail to capture topology and symbolic logic in engineering schematics, as their pixel-driven paradigm discards the explicit vector-defined relations needed for reasoning. To overcome this, we propose a Vector-to-Graph (V2G) pipeline that converts CAD diagrams into property graphs where nodes represent components and edges encode connectivity, making structural dependencies explicit and machine-auditable. On a diagnostic benchmark of electrical compliance checks, V2G yields large accuracy gains across all error categories, while leading MLLMs remain near chance level. These results highlight the systemic inadequacy of pixel-based methods and demonstrate that structure-aware representations provide a reliable path toward practical deployment of multimodal AI in engineering domains. To facilitate further research, we release our benchmark and implementation at https://github.com/gm-embodied/V2G-Audit.