IAD-GPT: Advancing Visual Knowledge in Multimodal Large Language Model for Industrial Anomaly Detection

📅 2025-10-15

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

To address weak multi-round human–machine interaction, insufficient fine-grained semantic description, and poor visual grounding capability of multimodal large models in industrial anomaly detection, this paper proposes a multimodal method integrating image-level and pixel-level visual information with enhanced textual semantic guidance. The method introduces three key innovations: (1) an anomaly prompt generator that leverages CLIP and large language models collaboratively to produce interpretable anomaly descriptions; (2) a text-guided enhancement module enabling cross-modal feature alignment; and (3) a multi-mask fusion mechanism improving pixel-level localization accuracy. Evaluated on MVTec-AD and VisA under both self-supervised and few-shot settings, the method achieves state-of-the-art performance in anomaly detection and segmentation. Moreover, it supports fine-grained anomaly characterization and interactive, multi-turn diagnostic reasoning.

Technology Category

Application Category

📝 Abstract

The robust causal capability of Multimodal Large Language Models (MLLMs) hold the potential of detecting defective objects in Industrial Anomaly Detection (IAD). However, most traditional IAD methods lack the ability to provide multi-turn human-machine dialogues and detailed descriptions, such as the color of objects, the shape of an anomaly, or specific types of anomalies. At the same time, methods based on large pre-trained models have not fully stimulated the ability of large models in anomaly detection tasks. In this paper, we explore the combination of rich text semantics with both image-level and pixel-level information from images and propose IAD-GPT, a novel paradigm based on MLLMs for IAD. We employ Abnormal Prompt Generator (APG) to generate detailed anomaly prompts for specific objects. These specific prompts from the large language model (LLM) are used to activate the detection and segmentation functions of the pre-trained visual-language model (i.e., CLIP). To enhance the visual grounding ability of MLLMs, we propose Text-Guided Enhancer, wherein image features interact with normal and abnormal text prompts to dynamically select enhancement pathways, which enables language models to focus on specific aspects of visual data, enhancing their ability to accurately interpret and respond to anomalies within images. Moreover, we design a Multi-Mask Fusion module to incorporate mask as expert knowledge, which enhances the LLM's perception of pixel-level anomalies. Extensive experiments on MVTec-AD and VisA datasets demonstrate our state-of-the-art performance on self-supervised and few-shot anomaly detection and segmentation tasks, such as MVTec-AD and VisA datasets. The codes are available at href{https://github.com/LiZeWen1225/IAD-GPT}{https://github.com/LiZeWen1225/IAD-GPT}.

Problem

Research questions and friction points this paper is trying to address.

Enhancing multimodal models for industrial anomaly detection tasks

Addressing lack of detailed anomaly descriptions in current methods

Improving pixel-level anomaly segmentation through visual grounding

Innovation

Methods, ideas, or system contributions that make the work stand out.

APG generates detailed anomaly prompts for objects

Text-Guided Enhancer dynamically selects visual enhancement pathways

Multi-Mask Fusion incorporates masks as expert knowledge

🔎 Similar Papers

Customizing Visual-Language Foundation Models for Multi-modal Anomaly Detection and Reasoning