🤖 AI Summary
Addressing three key challenges in zero-shot anomaly detection—absence of normal training samples, weak cross-domain generalization, and the performance gap between image-level and pixel-level localization—this paper proposes a context-aware CLIP-enhanced framework. Methodologically, it introduces: (1) a novel image-context-guided dynamic text prompt generation mechanism for semantic adaptability; (2) a restructured CLIP vision encoder that extracts high-fidelity dense features and incorporates an attention refinement module to explicitly model spatial structure; and (3) a dual-prompt collaborative modeling paradigm integrating both “normal” and “abnormal” textual prompts. Evaluated on 14 benchmark datasets, the framework consistently outperforms state-of-the-art methods—including AnomalyCLIP and AdaCLIP—achieving improvements of 2%–29% in both image-level classification and pixel-level segmentation accuracy, while significantly enhancing zero-shot cross-domain generalization capability.
📝 Abstract
Anomaly Detection (AD) involves identifying deviations from normal data distributions and is critical in fields such as medical diagnostics and industrial defect detection. Traditional AD methods typically require the availability of normal training samples; however, this assumption is not always feasible, as collecting such data can be impractical. Additionally, these methods often struggle to generalize across different domains. Recent advancements, such as AnomalyCLIP and AdaCLIP, utilize the zero-shot generalization capabilities of CLIP but still face a performance gap between image-level and pixel-level anomaly detection. To address this gap, we propose a novel approach that conditions the prompts of the text encoder based on image context extracted from the vision encoder. Also, to capture fine-grained variations more effectively, we have modified the CLIP vision encoder and altered the extraction of dense features. These changes ensure that the features retain richer spatial and structural information for both normal and anomalous prompts. Our method achieves state-of-the-art performance, improving performance by 2% to 29% across different metrics on 14 datasets. This demonstrates its effectiveness in both image-level and pixel-level anomaly detection.