🤖 AI Summary
This study addresses the lack of systematic evaluation benchmarks for large multimodal models (LMMs) in structural defect detection, where existing approaches are often limited to passive perception and exhibit weak generalization. To bridge this gap, we introduce DefectBench, the first hierarchical, multidimensional benchmark tailored for structural pathology reasoning, encompassing three cognitive levels: semantic perception, spatial localization, and generative geometric segmentation. By unifying twelve fragmented datasets into a single high-quality, open-source repository and employing a human-in-the-loop semi-automatic annotation pipeline validated by domain experts, our framework enables zero-shot generative segmentation. Experiments demonstrate that off-the-shelf LMMs, without any domain-specific training, achieve semantic and topological understanding comparable to specialized supervised models, yet still lag significantly in metric-level localization accuracy.
📝 Abstract
Automated building facade inspection is a critical component of urban resilience and smart city maintenance. Traditionally, this field has relied on specialized discriminative models (e.g., YOLO, Mask R-CNN) that excel at pixel-level localization but are constrained to passive perception and worse generization without the visual understandng to interpret structural topology. Large Multimodal Models (LMMs) promise a paradigm shift toward active reasoning, yet their application in such high-stakes engineering domains lacks rigorous evaluation standards. To bridge this gap, we introduce a human-in-the-loop semi-automated annotation framework, leveraging expert-proposal verification to unify 12 fragmented datasets into a standardized, hierarchical ontology. Building on this foundation, we present \textit{DefectBench}, the first multi-dimensional benchmark designed to interrogate LMMs beyond basic semantic recognition. \textit{DefectBench} evaluates 18 state-of-the-art (SOTA) LMMs across three escalating cognitive dimensions: Semantic Perception, Spatial Localization, and Generative Geometry Segmentation. Extensive experiments reveal that while current LMMs demonstrate exceptional topological awareness and semantic understanding (effectively diagnosing "what" and "how"), they exhibit significant deficiencies in metric localization precision ("where"). Crucially, however, we validate the viability of zero-shot generative segmentation, showing that general-purpose foundation models can rival specialized supervised networks without domain-specific training. This work provides both a rigorous benchmarking standard and a high-quality open-source database, establishing a new baseline for the advancement of autonomous AI agents in civil engineering.