VLMDiff: Leveraging Vision-Language Models for Multi-Class Anomaly Detection with Diffusion

📅 2025-11-11

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

Addressing the challenge of detecting diverse, multi-class visual anomalies in real-world images, this paper proposes an unsupervised multi-class anomaly detection framework that pioneers the integration of pretrained vision-language models (VLMs) with latent diffusion models (LDMs). Specifically, the VLM automatically generates semantic image descriptions to serve as conditional inputs for the LDM—enabling accurate modeling of complex normal patterns without manual annotations or additional training. Unlike conventional diffusion-based methods, our approach eliminates reliance on synthetic noise perturbations and restrictive single-class assumptions, thereby substantially improving generalizability and scalability. Evaluated on Real-IAD and COCO-AD benchmarks, the method achieves pixel-level Per-Region Overlap (PRO) improvements of +25.0 and +8.0 points, respectively, surpassing existing diffusion-based state-of-the-art approaches.

Technology Category

Application Category

📝 Abstract

Detecting visual anomalies in diverse, multi-class real-world images is a significant challenge. We introduce ours, a novel unsupervised multi-class visual anomaly detection framework. It integrates a Latent Diffusion Model (LDM) with a Vision-Language Model (VLM) for enhanced anomaly localization and detection. Specifically, a pre-trained VLM with a simple prompt extracts detailed image descriptions, serving as additional conditioning for LDM training. Current diffusion-based methods rely on synthetic noise generation, limiting their generalization and requiring per-class model training, which hinders scalability. ours, however, leverages VLMs to obtain normal captions without manual annotations or additional training. These descriptions condition the diffusion model, learning a robust normal image feature representation for multi-class anomaly detection. Our method achieves competitive performance, improving the pixel-level Per-Region-Overlap (PRO) metric by up to 25 points on the Real-IAD dataset and 8 points on the COCO-AD dataset, outperforming state-of-the-art diffusion-based approaches. Code is available at https://github.com/giddyyupp/VLMDiff.

Problem

Research questions and friction points this paper is trying to address.

Detecting visual anomalies in diverse multi-class real-world images

Overcoming limitations of synthetic noise generation in diffusion models

Eliminating per-class model training requirements for anomaly detection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates Vision-Language Model with Latent Diffusion Model

Uses VLM-generated captions as diffusion conditioning

Learns robust multi-class anomaly representation without per-class training

🔎 Similar Papers

Customizing Visual-Language Foundation Models for Multi-modal Anomaly Detection and Reasoning