π€ AI Summary
Addressing the challenge of detecting diverse, multi-class visual anomalies in real-world images, this paper proposes an unsupervised multi-class anomaly detection framework that pioneers the integration of pretrained vision-language models (VLMs) with latent diffusion models (LDMs). Specifically, the VLM automatically generates semantic image descriptions to serve as conditional inputs for the LDMβenabling accurate modeling of complex normal patterns without manual annotations or additional training. Unlike conventional diffusion-based methods, our approach eliminates reliance on synthetic noise perturbations and restrictive single-class assumptions, thereby substantially improving generalizability and scalability. Evaluated on Real-IAD and COCO-AD benchmarks, the method achieves pixel-level Per-Region Overlap (PRO) improvements of +25.0 and +8.0 points, respectively, surpassing existing diffusion-based state-of-the-art approaches.
π Abstract
Detecting visual anomalies in diverse, multi-class real-world images is a significant challenge. We introduce ours, a novel unsupervised multi-class visual anomaly detection framework. It integrates a Latent Diffusion Model (LDM) with a Vision-Language Model (VLM) for enhanced anomaly localization and detection. Specifically, a pre-trained VLM with a simple prompt extracts detailed image descriptions, serving as additional conditioning for LDM training. Current diffusion-based methods rely on synthetic noise generation, limiting their generalization and requiring per-class model training, which hinders scalability. ours, however, leverages VLMs to obtain normal captions without manual annotations or additional training. These descriptions condition the diffusion model, learning a robust normal image feature representation for multi-class anomaly detection. Our method achieves competitive performance, improving the pixel-level Per-Region-Overlap (PRO) metric by up to 25 points on the Real-IAD dataset and 8 points on the COCO-AD dataset, outperforming state-of-the-art diffusion-based approaches. Code is available at https://github.com/giddyyupp/VLMDiff.