VLMDiff: Leveraging Vision-Language Models for Multi-Class Anomaly Detection with Diffusion

πŸ“… 2025-11-11
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Addressing the challenge of detecting diverse, multi-class visual anomalies in real-world images, this paper proposes an unsupervised multi-class anomaly detection framework that pioneers the integration of pretrained vision-language models (VLMs) with latent diffusion models (LDMs). Specifically, the VLM automatically generates semantic image descriptions to serve as conditional inputs for the LDMβ€”enabling accurate modeling of complex normal patterns without manual annotations or additional training. Unlike conventional diffusion-based methods, our approach eliminates reliance on synthetic noise perturbations and restrictive single-class assumptions, thereby substantially improving generalizability and scalability. Evaluated on Real-IAD and COCO-AD benchmarks, the method achieves pixel-level Per-Region Overlap (PRO) improvements of +25.0 and +8.0 points, respectively, surpassing existing diffusion-based state-of-the-art approaches.

Technology Category

Application Category

πŸ“ Abstract
Detecting visual anomalies in diverse, multi-class real-world images is a significant challenge. We introduce ours, a novel unsupervised multi-class visual anomaly detection framework. It integrates a Latent Diffusion Model (LDM) with a Vision-Language Model (VLM) for enhanced anomaly localization and detection. Specifically, a pre-trained VLM with a simple prompt extracts detailed image descriptions, serving as additional conditioning for LDM training. Current diffusion-based methods rely on synthetic noise generation, limiting their generalization and requiring per-class model training, which hinders scalability. ours, however, leverages VLMs to obtain normal captions without manual annotations or additional training. These descriptions condition the diffusion model, learning a robust normal image feature representation for multi-class anomaly detection. Our method achieves competitive performance, improving the pixel-level Per-Region-Overlap (PRO) metric by up to 25 points on the Real-IAD dataset and 8 points on the COCO-AD dataset, outperforming state-of-the-art diffusion-based approaches. Code is available at https://github.com/giddyyupp/VLMDiff.
Problem

Research questions and friction points this paper is trying to address.

Detecting visual anomalies in diverse multi-class real-world images
Overcoming limitations of synthetic noise generation in diffusion models
Eliminating per-class model training requirements for anomaly detection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates Vision-Language Model with Latent Diffusion Model
Uses VLM-generated captions as diffusion conditioning
Learns robust multi-class anomaly representation without per-class training
πŸ”Ž Similar Papers
No similar papers found.
S
Samet Hicsonmez
University of Luxembourg, Luxembourg, Luxembourg
Abd El Rahman Shabayek
Abd El Rahman Shabayek
Research Scientist, SnT, University of Luxembourg
Computer VisionRoboticsOmnidirectional visionPolarization visionNon-conventional omnidirectional sensors
D
D. Aouada
University of Luxembourg, Luxembourg, Luxembourg