MMAD: The First-Ever Comprehensive Benchmark for Multimodal Large Language Models in Industrial Anomaly Detection

📅 2024-10-12
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal large language models (MLLMs) lack standardized evaluation for industrial defect detection, hindering systematic assessment of their capabilities in real-world manufacturing scenarios. Method: We introduce MMAD—the first comprehensive, industrial-domain-specific benchmark for MLLMs—comprising 7 core subtasks, 8,366 industrial images, and 39,672 question-answer pairs. We formally define MLLM capability dimensions for industrial inspection, propose an automated multimodal data synthesis pipeline and structured task modeling framework, and design a zero-shot prompt optimization strategy. Contribution/Results: Through a cross-model quantitative evaluation framework, we find that the state-of-the-art model (GPT-4o) achieves only 74.9% average accuracy, revealing critical bottlenecks in fine-grained defect comprehension and contextual reasoning. MMAD establishes a rigorous foundation for benchmarking, methodological development, and diagnostic analysis of MLLMs in industrial applications.

Technology Category

Application Category

📝 Abstract
In the field of industrial inspection, Multimodal Large Language Models (MLLMs) have a high potential to renew the paradigms in practical applications due to their robust language capabilities and generalization abilities. However, despite their impressive problem-solving skills in many domains, MLLMs' ability in industrial anomaly detection has not been systematically studied. To bridge this gap, we present MMAD, the first-ever full-spectrum MLLMs benchmark in industrial Anomaly Detection. We defined seven key subtasks of MLLMs in industrial inspection and designed a novel pipeline to generate the MMAD dataset with 39,672 questions for 8,366 industrial images. With MMAD, we have conducted a comprehensive, quantitative evaluation of various state-of-the-art MLLMs. The commercial models performed the best, with the average accuracy of GPT-4o models reaching 74.9%. However, this result falls far short of industrial requirements. Our analysis reveals that current MLLMs still have significant room for improvement in answering questions related to industrial anomalies and defects. We further explore two training-free performance enhancement strategies to help models improve in industrial scenarios, highlighting their promising potential for future research.
Problem

Research questions and friction points this paper is trying to address.

Multimodal Large Language Models
Factory Environment
Defect Detection
Innovation

Methods, ideas, or system contributions that make the work stand out.

MMAD
Multi-modal Large Language Models
Industrial Defect Detection
🔎 Similar Papers
No similar papers found.
Xi Jiang
Xi Jiang
South University of Science and Technology
Computer VisionDeep Learning
J
Jian Li
Tencent YouTu Lab
Hanqiu Deng
Hanqiu Deng
PhD student, University of Alberta
computer vision
Y
Yong Liu
Tencent YouTu Lab
Bin-Bin Gao
Bin-Bin Gao
Senior Researcher, Tencent YouTu
Computer VisionMachine LearningArtificial Intelligence
Yifeng Zhou
Yifeng Zhou
Tencent YouTu Lab
J
Jialin Li
Tencent YouTu Lab
C
Chengjie Wang
Tencent YouTu Lab, Shanghai Jiao Tong University
F
Feng Zheng
Southern University of Science and Technology