Crane: Context-Guided Prompt Learning and Attention Refinement for Zero-Shot Anomaly Detections

📅 2025-04-15

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Addressing three key challenges in zero-shot anomaly detection—absence of normal training samples, weak cross-domain generalization, and the performance gap between image-level and pixel-level localization—this paper proposes a context-aware CLIP-enhanced framework. Methodologically, it introduces: (1) a novel image-context-guided dynamic text prompt generation mechanism for semantic adaptability; (2) a restructured CLIP vision encoder that extracts high-fidelity dense features and incorporates an attention refinement module to explicitly model spatial structure; and (3) a dual-prompt collaborative modeling paradigm integrating both “normal” and “abnormal” textual prompts. Evaluated on 14 benchmark datasets, the framework consistently outperforms state-of-the-art methods—including AnomalyCLIP and AdaCLIP—achieving improvements of 2%–29% in both image-level classification and pixel-level segmentation accuracy, while significantly enhancing zero-shot cross-domain generalization capability.

Technology Category

Application Category

📝 Abstract

Anomaly Detection (AD) involves identifying deviations from normal data distributions and is critical in fields such as medical diagnostics and industrial defect detection. Traditional AD methods typically require the availability of normal training samples; however, this assumption is not always feasible, as collecting such data can be impractical. Additionally, these methods often struggle to generalize across different domains. Recent advancements, such as AnomalyCLIP and AdaCLIP, utilize the zero-shot generalization capabilities of CLIP but still face a performance gap between image-level and pixel-level anomaly detection. To address this gap, we propose a novel approach that conditions the prompts of the text encoder based on image context extracted from the vision encoder. Also, to capture fine-grained variations more effectively, we have modified the CLIP vision encoder and altered the extraction of dense features. These changes ensure that the features retain richer spatial and structural information for both normal and anomalous prompts. Our method achieves state-of-the-art performance, improving performance by 2% to 29% across different metrics on 14 datasets. This demonstrates its effectiveness in both image-level and pixel-level anomaly detection.

Problem

Research questions and friction points this paper is trying to address.

Improving zero-shot anomaly detection across domains

Bridging performance gap in image-pixel level detection

Enhancing CLIP features for spatial-structural information retention

Innovation

Methods, ideas, or system contributions that make the work stand out.

Context-guided prompt learning for text encoder

Modified CLIP vision encoder for fine-grained variations

Altered dense feature extraction for richer information

🔎 Similar Papers

AnomalyCLIP: Object-agnostic Prompt Learning for Zero-shot Anomaly Detection