Do multimodal large language models understand welding?

📅 2025-03-01
🏛️ Information Fusion
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multimodal large language models (MLLMs) exhibit insufficient vision-language understanding in industrial skill domains, exemplified by welding. Method: We introduce WeldBench—the first welding-specific multimodal evaluation benchmark—and propose a fine-grained industrial skill understanding assessment framework. It integrates welding imagery, process-related textual descriptions, and expert annotations, leveraging zero-shot/few-shot prompting, attention visualization, and cross-modal feature attribution analysis. Results: Experiments reveal that state-of-the-art MLLMs underperform significantly relative to human experts—achieving <42% accuracy on welding defect classification and failing at procedural logical reasoning—highlighting a critical gap in domain-specific semantic alignment. This work is the first to systematically expose MLLMs’ limitations in high-precision, skill-intensive tasks, empirically validating the necessity of domain adaptation and expert knowledge injection. It establishes a reproducible evaluation paradigm and concrete improvement pathways for industrial-grade MLLM development.

Technology Category

Application Category

Problem

Research questions and friction points this paper is trying to address.

Evaluating MLLMs' performance in assessing weld acceptability across industries
Introducing WeldPrompt to reduce hallucinations and enhance reasoning
Exploring limitations and potentials of MLLMs in technical welding tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates MLLMs on real-world weld images
Introduces WeldPrompt for improved reasoning
Combines Chain-of-Thought with in-context learning
🔎 Similar Papers
No similar papers found.