🤖 AI Summary
Multimodal large language models (MLLMs) exhibit insufficient vision-language understanding in industrial skill domains, exemplified by welding. Method: We introduce WeldBench—the first welding-specific multimodal evaluation benchmark—and propose a fine-grained industrial skill understanding assessment framework. It integrates welding imagery, process-related textual descriptions, and expert annotations, leveraging zero-shot/few-shot prompting, attention visualization, and cross-modal feature attribution analysis. Results: Experiments reveal that state-of-the-art MLLMs underperform significantly relative to human experts—achieving <42% accuracy on welding defect classification and failing at procedural logical reasoning—highlighting a critical gap in domain-specific semantic alignment. This work is the first to systematically expose MLLMs’ limitations in high-precision, skill-intensive tasks, empirically validating the necessity of domain adaptation and expert knowledge injection. It establishes a reproducible evaluation paradigm and concrete improvement pathways for industrial-grade MLLM development.