MMAD: The First-Ever Comprehensive Benchmark for Multimodal Large Language Models in Industrial Anomaly Detection

📅 2024-10-12

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

Existing multimodal large language models (MLLMs) lack standardized evaluation for industrial defect detection, hindering systematic assessment of their capabilities in real-world manufacturing scenarios. Method: We introduce MMAD—the first comprehensive, industrial-domain-specific benchmark for MLLMs—comprising 7 core subtasks, 8,366 industrial images, and 39,672 question-answer pairs. We formally define MLLM capability dimensions for industrial inspection, propose an automated multimodal data synthesis pipeline and structured task modeling framework, and design a zero-shot prompt optimization strategy. Contribution/Results: Through a cross-model quantitative evaluation framework, we find that the state-of-the-art model (GPT-4o) achieves only 74.9% average accuracy, revealing critical bottlenecks in fine-grained defect comprehension and contextual reasoning. MMAD establishes a rigorous foundation for benchmarking, methodological development, and diagnostic analysis of MLLMs in industrial applications.

Technology Category

Application Category

📝 Abstract

In the field of industrial inspection, Multimodal Large Language Models (MLLMs) have a high potential to renew the paradigms in practical applications due to their robust language capabilities and generalization abilities. However, despite their impressive problem-solving skills in many domains, MLLMs' ability in industrial anomaly detection has not been systematically studied. To bridge this gap, we present MMAD, the first-ever full-spectrum MLLMs benchmark in industrial Anomaly Detection. We defined seven key subtasks of MLLMs in industrial inspection and designed a novel pipeline to generate the MMAD dataset with 39,672 questions for 8,366 industrial images. With MMAD, we have conducted a comprehensive, quantitative evaluation of various state-of-the-art MLLMs. The commercial models performed the best, with the average accuracy of GPT-4o models reaching 74.9%. However, this result falls far short of industrial requirements. Our analysis reveals that current MLLMs still have significant room for improvement in answering questions related to industrial anomalies and defects. We further explore two training-free performance enhancement strategies to help models improve in industrial scenarios, highlighting their promising potential for future research.

Problem

Research questions and friction points this paper is trying to address.

Multimodal Large Language Models

Factory Environment

Defect Detection

Innovation

Methods, ideas, or system contributions that make the work stand out.

MMAD

Multi-modal Large Language Models

Industrial Defect Detection

🔎 Similar Papers

Surveying the MLLM Landscape: A Meta-Review of Current Surveys