VisualAD: Language-Free Zero-Shot Anomaly Detection via Vision Transformer

📅 2026-03-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of zero-shot anomaly detection and localization in the absence of labeled anomalous samples by introducing the first purely vision-based framework that entirely dispenses with text encoders and cross-modal alignment. Built upon the Vision Transformer architecture, the method incorporates learnable semantic tokens representing normal and anomalous concepts, coupled with a spatially aware cross-attention (SCA) mechanism and a self-alignment fusion (SAF) strategy to enable robust and efficient anomaly discrimination. The proposed framework achieves state-of-the-art performance across 13 industrial and medical benchmark datasets and seamlessly integrates with various pretrained visual backbones—such as CLIP and DINOv2—significantly enhancing its generalizability and practical applicability.

Technology Category

Application Category

📝 Abstract
Zero-shot anomaly detection (ZSAD) requires detecting and localizing anomalies without access to target-class anomaly samples. Mainstream methods rely on vision-language models (VLMs) such as CLIP: they build hand-crafted or learned prompt sets for normal and abnormal semantics, then compute image-text similarities for open-set discrimination. While effective, this paradigm depends on a text encoder and cross-modal alignment, which can lead to training instability and parameter redundancy. This work revisits the necessity of the text branch in ZSAD and presents VisualAD, a purely visual framework built on Vision Transformers. We introduce two learnable tokens within a frozen backbone to directly encode normality and abnormality. Through multi-layer self-attention, these tokens interact with patch tokens, gradually acquiring high-level notions of normality and anomaly while guiding patches to highlight anomaly-related cues. Additionally, we incorporate a Spatial-Aware Cross-Attention (SCA) module and a lightweight Self-Alignment Function (SAF): SCA injects fine-grained spatial information into the tokens, and SAF recalibrates patch features before anomaly scoring. VisualAD achieves state-of-the-art performance on 13 zero-shot anomaly detection benchmarks spanning industrial and medical domains, and adapts seamlessly to pretrained vision backbones such as the CLIP image encoder and DINOv2. Code: https://github.com/7HHHHH/VisualAD
Problem

Research questions and friction points this paper is trying to address.

zero-shot anomaly detection
anomaly localization
vision-language models
normality modeling
open-set discrimination
Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-Shot Anomaly Detection
Vision Transformer
Language-Free
Spatial-Aware Cross-Attention
Self-Alignment Function
🔎 Similar Papers
No similar papers found.
Y
Yanning Hou
College of Intelligence Science and Technology, National University of Defense Technology, Changsha, China
P
Peiyuan Li
School of Artificial Intelligence, Anhui University, Hefei, China
Zirui Liu
Zirui Liu
Peking University
SystemsAlgorithmsData Structures
Yitong Wang
Yitong Wang
ByteDance Inc.
computer vision
Y
Yanran Ruan
School of Artificial Intelligence, Anhui University, Hefei, China
J
Jianfeng Qiu
School of Artificial Intelligence, Anhui University, Hefei, China
Ke Xu
Ke Xu
Anhui University
Deep LearningNetwork QuantizationNetwork PruningNeural Architecture SearchFPGA