SSVP: Synergistic Semantic-Visual Prompting for Industrial Zero-Shot Anomaly Detection

📅 2026-01-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing zero-shot anomaly detection methods, which rely on a single visual backbone and struggle to simultaneously achieve global semantic generalization and fine-grained structural discrimination. To overcome this, the authors propose a hierarchical semantic–visual collaboration mechanism coupled with a dual-gated calibration paradigm. This approach uniquely integrates DINOv3’s multi-scale structural priors into CLIP’s semantic space and employs dynamic language prompting to enable precise localization of anomalous regions. By synergistically combining vision–language models, cross-modal attention, and multi-source visual encodings, the method achieves state-of-the-art zero-shot performance with 93.0% Image-AUROC and 92.2% Pixel-AUROC on MVTec-AD, significantly outperforming prior approaches, and demonstrates robustness across seven industrial benchmarks.

Technology Category

Application Category

📝 Abstract
Zero-Shot Anomaly Detection (ZSAD) leverages Vision-Language Models (VLMs) to enable supervision-free industrial inspection. However, existing ZSAD paradigms are constrained by single visual backbones, which struggle to balance global semantic generalization with fine-grained structural discriminability. To bridge this gap, we propose Synergistic Semantic-Visual Prompting (SSVP), that efficiently fuses diverse visual encodings to elevate model's fine-grained perception. Specifically, SSVP introduces the Hierarchical Semantic-Visual Synergy (HSVS) mechanism, which deeply integrates DINOv3's multi-scale structural priors into the CLIP semantic space. Subsequently, the Vision-Conditioned Prompt Generator (VCPG) employs cross-modal attention to guide dynamic prompt generation, enabling linguistic queries to precisely anchor to specific anomaly patterns. Furthermore, to address the discrepancy between global scoring and local evidence, the Visual-Text Anomaly Mapper (VTAM) establishes a dual-gated calibration paradigm. Extensive evaluations on seven industrial benchmarks validate the robustness of our method; SSVP achieves state-of-the-art performance with 93.0% Image-AUROC and 92.2% Pixel-AUROC on MVTec-AD, significantly outperforming existing zero-shot approaches.
Problem

Research questions and friction points this paper is trying to address.

Zero-Shot Anomaly Detection
Vision-Language Models
Industrial Inspection
Semantic-Visual Fusion
Fine-grained Perception
Innovation

Methods, ideas, or system contributions that make the work stand out.

Synergistic Semantic-Visual Prompting
Zero-Shot Anomaly Detection
Vision-Language Models
Hierarchical Semantic-Visual Synergy
Cross-modal Attention
🔎 Similar Papers
No similar papers found.
C
Chenhao Fu
Beijing University of Posts and Telecommunications, Beijing, 100876, China
Han Fang
Han Fang
TeleAI, China Telecom (中国电信人工智能研究院 TeleAI)
Text-to-imageMLLMVideo-text retrievalFace recognition
X
Xiuzheng Zheng
Institute of Artificial Intelligence (TeleAI), China Telecom, China
W
Wenbo Wei
Beijing University of Posts and Telecommunications, Beijing, 100876, China
Y
Yonghua Li
Beijing University of Posts and Telecommunications, Beijing, 100876, China
Hao Sun
Hao Sun
Central China Normal University
computer visionhyperspectral image classificationremote sensing scene classification
X
Xuelong Li
Institute of Artificial Intelligence (TeleAI), China Telecom, China