🤖 AI Summary
This work investigates the feasibility and robustness of unified vision-language models (VLMs) as replacements for task-specific models in industrial inspection. To address diverse requirements—including image classification, object detection, and keypoint localization—it proposes a language-interface-based unified paradigm, introduces InspectMM—the first large-scale multimodal industrial inspection dataset—and performs instruction tuning on Florence-2. Experiments show strong performance on classification and structured keypoint tasks, but limited robustness on fine-grained detection, high sensitivity to prompt engineering, and weaker visual grounding compared to specialized architectures like ResNet. The core contribution lies in empirically exposing the fundamental tension between *linguistic unification* and *visual reliability* in VLMs for industrial applications, thereby establishing an empirical benchmark and identifying concrete directions for advancing VLMs toward high-precision industrial deployment.
📝 Abstract
Unified vision-language models (VLMs) promise to streamline computer vision pipelines by reframing multiple visual tasks such as classification, detection, and keypoint localization within a single language-driven interface. This architecture is particularly appealing in industrial inspection, where managing disjoint task-specific models introduces complexity, inefficiency, and maintenance overhead. In this paper, we critically evaluate the viability of this unified paradigm using InspectVLM, a Florence-2-based VLM trained on InspectMM, our new large-scale multimodal, multitask inspection dataset. While InspectVLM performs competitively on image-level classification and structured keypoint tasks, we find that it fails to match traditional ResNet-based models in core inspection metrics. Notably, the model exhibits brittle behavior under low prompt variability, produces degenerate outputs for fine-grained object detection, and frequently defaults to memorized language responses regardless of visual input. Our findings suggest that while language-driven unification offers conceptual elegance, current VLMs lack the visual grounding and robustness necessary for deployment in precision critical industrial inspections.