GenCLIP: Generalizing CLIP Prompts for Zero-shot Anomaly Detection

📅 2025-04-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Zero-shot anomaly detection (ZSAD) suffers from unstable optimization of generic CLIP text prompts and poor cross-class generalization. To address these challenges, we propose a multi-layer prompt fusion and dual-branch inference framework. First, we design a multi-level CLIP visual feature-driven generic prompt enhancement mechanism to improve prompt robustness. Second, we construct a dual-branch contrastive inference architecture—comprising a vision-enhanced branch and a pure-text query branch—to decouple semantic alignment from discriminative learning. Third, we introduce an adaptive text prompt filtering module that automatically removes unseen class names from CLIP’s vocabulary to mitigate semantic drift. Evaluated on multiple ZSAD benchmarks, our method significantly improves anomaly localization accuracy and cross-class generalization, particularly achieving more stable and interpretable zero-shot discrimination on unseen categories.

Technology Category

Application Category

📝 Abstract
Zero-shot anomaly detection (ZSAD) aims to identify anomalies in unseen categories by leveraging CLIP's zero-shot capabilities to match text prompts with visual features. A key challenge in ZSAD is learning general prompts stably and utilizing them effectively, while maintaining both generalizability and category specificity. Although general prompts have been explored in prior works, achieving their stable optimization and effective deployment remains a significant challenge. In this work, we propose GenCLIP, a novel framework that learns and leverages general prompts more effectively through multi-layer prompting and dual-branch inference. Multi-layer prompting integrates category-specific visual cues from different CLIP layers, enriching general prompts with more comprehensive and robust feature representations. By combining general prompts with multi-layer visual features, our method further enhances its generalization capability. To balance specificity and generalization, we introduce a dual-branch inference strategy, where a vision-enhanced branch captures fine-grained category-specific features, while a query-only branch prioritizes generalization. The complementary outputs from both branches improve the stability and reliability of anomaly detection across unseen categories. Additionally, we propose an adaptive text prompt filtering mechanism, which removes irrelevant or atypical class names not encountered during CLIP's training, ensuring that only meaningful textual inputs contribute to the final vision-language alignment.
Problem

Research questions and friction points this paper is trying to address.

Improving zero-shot anomaly detection via stable general prompt learning
Enhancing CLIP's generalization with multi-layer visual feature integration
Balancing specificity and generalization through dual-branch inference strategy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-layer prompting integrates diverse CLIP features
Dual-branch inference balances specificity and generalization
Adaptive text filtering removes irrelevant class names
🔎 Similar Papers
No similar papers found.