🤖 AI Summary
This work addresses the limitations of existing zero-shot anomaly detection methods, which rely on a single visual backbone and struggle to simultaneously achieve global semantic generalization and fine-grained structural discrimination. To overcome this, the authors propose a hierarchical semantic–visual collaboration mechanism coupled with a dual-gated calibration paradigm. This approach uniquely integrates DINOv3’s multi-scale structural priors into CLIP’s semantic space and employs dynamic language prompting to enable precise localization of anomalous regions. By synergistically combining vision–language models, cross-modal attention, and multi-source visual encodings, the method achieves state-of-the-art zero-shot performance with 93.0% Image-AUROC and 92.2% Pixel-AUROC on MVTec-AD, significantly outperforming prior approaches, and demonstrates robustness across seven industrial benchmarks.
📝 Abstract
Zero-Shot Anomaly Detection (ZSAD) leverages Vision-Language Models (VLMs) to enable supervision-free industrial inspection. However, existing ZSAD paradigms are constrained by single visual backbones, which struggle to balance global semantic generalization with fine-grained structural discriminability. To bridge this gap, we propose Synergistic Semantic-Visual Prompting (SSVP), that efficiently fuses diverse visual encodings to elevate model's fine-grained perception. Specifically, SSVP introduces the Hierarchical Semantic-Visual Synergy (HSVS) mechanism, which deeply integrates DINOv3's multi-scale structural priors into the CLIP semantic space. Subsequently, the Vision-Conditioned Prompt Generator (VCPG) employs cross-modal attention to guide dynamic prompt generation, enabling linguistic queries to precisely anchor to specific anomaly patterns. Furthermore, to address the discrepancy between global scoring and local evidence, the Visual-Text Anomaly Mapper (VTAM) establishes a dual-gated calibration paradigm. Extensive evaluations on seven industrial benchmarks validate the robustness of our method; SSVP achieves state-of-the-art performance with 93.0% Image-AUROC and 92.2% Pixel-AUROC on MVTec-AD, significantly outperforming existing zero-shot approaches.