🤖 AI Summary
Zero-shot anomaly detection (ZSAD) faces two key challenges: static learnable prompts struggle to capture the continuous diversity of normal and anomalous states, while fixed textual labels suffer from semantic sparsity and are prone to overfitting. To address these, we propose CoPS—a conditional prompt synthesis framework that dynamically generates prompts conditioned on visual features, and jointly integrates fine-grained learnable prototypes with variational autoencoder–based implicit class encoding to enable state-adaptive perception and semantic enrichment. Crucially, we introduce a spatially aware alignment mechanism to mitigate prompt rigidity and label sparsity. Evaluated on 13 industrial and medical datasets, CoPS achieves an average 2.5% improvement in AUROC for both classification and segmentation tasks, demonstrating significantly enhanced cross-class generalization and zero-shot anomaly detection performance.
📝 Abstract
Recently, large pre-trained vision-language models have shown remarkable performance in zero-shot anomaly detection (ZSAD). With fine-tuning on a single auxiliary dataset, the model enables cross-category anomaly detection on diverse datasets covering industrial defects and medical lesions. Compared to manually designed prompts, prompt learning eliminates the need for expert knowledge and trial-and-error. However, it still faces the following challenges: (i) static learnable tokens struggle to capture the continuous and diverse patterns of normal and anomalous states, limiting generalization to unseen categories; (ii) fixed textual labels provide overly sparse category information, making the model prone to overfitting to a specific semantic subspace. To address these issues, we propose Conditional Prompt Synthesis (CoPS), a novel framework that synthesizes dynamic prompts conditioned on visual features to enhance ZSAD performance. Specifically, we extract representative normal and anomaly prototypes from fine-grained patch features and explicitly inject them into prompts, enabling adaptive state modeling. Given the sparsity of class labels, we leverage a variational autoencoder to model semantic image features and implicitly fuse varied class tokens into prompts. Additionally, integrated with our spatially-aware alignment mechanism, extensive experiments demonstrate that CoPS surpasses state-of-the-art methods by 2.5% AUROC in both classification and segmentation across 13 industrial and medical datasets. Code will be available at https://github.com/cqylunlun/CoPS.