VisualAD: Language-Free Zero-Shot Anomaly Detection via Vision Transformer

📅 2026-03-09

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work addresses the challenge of zero-shot anomaly detection and localization in the absence of labeled anomalous samples by introducing the first purely vision-based framework that entirely dispenses with text encoders and cross-modal alignment. Built upon the Vision Transformer architecture, the method incorporates learnable semantic tokens representing normal and anomalous concepts, coupled with a spatially aware cross-attention (SCA) mechanism and a self-alignment fusion (SAF) strategy to enable robust and efficient anomaly discrimination. The proposed framework achieves state-of-the-art performance across 13 industrial and medical benchmark datasets and seamlessly integrates with various pretrained visual backbones—such as CLIP and DINOv2—significantly enhancing its generalizability and practical applicability.

Technology Category

Application Category

📝 Abstract

Zero-shot anomaly detection (ZSAD) requires detecting and localizing anomalies without access to target-class anomaly samples. Mainstream methods rely on vision-language models (VLMs) such as CLIP: they build hand-crafted or learned prompt sets for normal and abnormal semantics, then compute image-text similarities for open-set discrimination. While effective, this paradigm depends on a text encoder and cross-modal alignment, which can lead to training instability and parameter redundancy. This work revisits the necessity of the text branch in ZSAD and presents VisualAD, a purely visual framework built on Vision Transformers. We introduce two learnable tokens within a frozen backbone to directly encode normality and abnormality. Through multi-layer self-attention, these tokens interact with patch tokens, gradually acquiring high-level notions of normality and anomaly while guiding patches to highlight anomaly-related cues. Additionally, we incorporate a Spatial-Aware Cross-Attention (SCA) module and a lightweight Self-Alignment Function (SAF): SCA injects fine-grained spatial information into the tokens, and SAF recalibrates patch features before anomaly scoring. VisualAD achieves state-of-the-art performance on 13 zero-shot anomaly detection benchmarks spanning industrial and medical domains, and adapts seamlessly to pretrained vision backbones such as the CLIP image encoder and DINOv2. Code: https://github.com/7HHHHH/VisualAD

Problem

Research questions and friction points this paper is trying to address.

zero-shot anomaly detection

anomaly localization

vision-language models

normality modeling

open-set discrimination

Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-Shot Anomaly Detection

Vision Transformer

Language-Free