IAD-GPT: Advancing Visual Knowledge in Multimodal Large Language Model for Industrial Anomaly Detection

๐Ÿ“… 2025-10-15
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address weak multi-round humanโ€“machine interaction, insufficient fine-grained semantic description, and poor visual grounding capability of multimodal large models in industrial anomaly detection, this paper proposes a multimodal method integrating image-level and pixel-level visual information with enhanced textual semantic guidance. The method introduces three key innovations: (1) an anomaly prompt generator that leverages CLIP and large language models collaboratively to produce interpretable anomaly descriptions; (2) a text-guided enhancement module enabling cross-modal feature alignment; and (3) a multi-mask fusion mechanism improving pixel-level localization accuracy. Evaluated on MVTec-AD and VisA under both self-supervised and few-shot settings, the method achieves state-of-the-art performance in anomaly detection and segmentation. Moreover, it supports fine-grained anomaly characterization and interactive, multi-turn diagnostic reasoning.

Technology Category

Application Category

๐Ÿ“ Abstract
The robust causal capability of Multimodal Large Language Models (MLLMs) hold the potential of detecting defective objects in Industrial Anomaly Detection (IAD). However, most traditional IAD methods lack the ability to provide multi-turn human-machine dialogues and detailed descriptions, such as the color of objects, the shape of an anomaly, or specific types of anomalies. At the same time, methods based on large pre-trained models have not fully stimulated the ability of large models in anomaly detection tasks. In this paper, we explore the combination of rich text semantics with both image-level and pixel-level information from images and propose IAD-GPT, a novel paradigm based on MLLMs for IAD. We employ Abnormal Prompt Generator (APG) to generate detailed anomaly prompts for specific objects. These specific prompts from the large language model (LLM) are used to activate the detection and segmentation functions of the pre-trained visual-language model (i.e., CLIP). To enhance the visual grounding ability of MLLMs, we propose Text-Guided Enhancer, wherein image features interact with normal and abnormal text prompts to dynamically select enhancement pathways, which enables language models to focus on specific aspects of visual data, enhancing their ability to accurately interpret and respond to anomalies within images. Moreover, we design a Multi-Mask Fusion module to incorporate mask as expert knowledge, which enhances the LLM's perception of pixel-level anomalies. Extensive experiments on MVTec-AD and VisA datasets demonstrate our state-of-the-art performance on self-supervised and few-shot anomaly detection and segmentation tasks, such as MVTec-AD and VisA datasets. The codes are available at href{https://github.com/LiZeWen1225/IAD-GPT}{https://github.com/LiZeWen1225/IAD-GPT}.
Problem

Research questions and friction points this paper is trying to address.

Enhancing multimodal models for industrial anomaly detection tasks
Addressing lack of detailed anomaly descriptions in current methods
Improving pixel-level anomaly segmentation through visual grounding
Innovation

Methods, ideas, or system contributions that make the work stand out.

APG generates detailed anomaly prompts for objects
Text-Guided Enhancer dynamically selects visual enhancement pathways
Multi-Mask Fusion incorporates masks as expert knowledge
๐Ÿ”Ž Similar Papers
No similar papers found.
Z
Zewen Li
School of Computer Science & Software Engineering, Shenzhen University, China, 518060, and School of Computing and Information Technology, Great Bay University, Dongguan, 523000, China
Zitong Yu
Zitong Yu
U.S. Food and Drug Administration
Medical imagingDeep learningMachine learningImage reconstruction
Q
Qilang Ye
College of Computer Science, Nankai University, Tianjin
Weicheng Xie
Weicheng Xie
Associate Professor, Shenzhen University
Facial expression analysisDeep learningImage processing
W
Wei Zhuo
School of Artificial Intelligence, Shenzhen University, Shenzhen 518060, China, Guangdong Provincial Key Laboratory of Intelligent Information Processing, Shenzhen University, Shenzhen, China, and National Engineering Laboratory of Big Data System Computing Technology, Shenzhen University
Linlin Shen
Linlin Shen
Shenzhen University
Deep LearningComputer VisionFacial Analysis/RecognitionMedical Image Analysis