🤖 AI Summary
This work proposes a novel framework for zero-shot anomaly detection that overcomes the limitations of fixed text prompts and single spatial-domain features, which struggle to capture complex semantics and subtle anomalies. The approach integrates multi-frequency visual analysis with semantic adaptability by employing a variational autoencoder to model global semantics and dynamically refine CLIP text embeddings. Concurrently, wavelet decomposition is leveraged to extract multi-scale frequency-domain features. A semantic-aware mixture-of-experts module is further introduced to enable fine-grained cross-modal alignment. Notably, this is the first method to combine wavelet-based multi-frequency analysis with mixture-of-experts prompt learning. Extensive experiments on 14 industrial and medical datasets demonstrate significant performance gains over existing approaches, highlighting its superior generalization capability.
📝 Abstract
Vision-language models have recently shown strong generalization in zero-shot anomaly detection (ZSAD), enabling the detection of unseen anomalies without task-specific supervision. However, existing approaches typically rely on fixed textual prompts, which struggle to capture complex semantics, and focus solely on spatial-domain features, limiting their ability to detect subtle anomalies. To address these challenges, we propose a wavelet-enhanced mixture-of-experts prompt learning method for ZSAD. Specifically, a variational autoencoder is employed to model global semantic representations and integrate them into prompts to enhance adaptability to diverse anomaly patterns. Wavelet decomposition extracts multi-frequency image features that dynamically refine textual embeddings through cross-modal interactions. Furthermore, a semantic-aware mixture-of-experts module is introduced to aggregate contextual information. Extensive experiments on 14 industrial and medical datasets demonstrate the effectiveness of the proposed method.