🤖 AI Summary
Existing zero-shot anomaly detection (ZSAD) methods suffer from representation bottlenecks and overfitting on auxiliary data due to fixed or densely activated prompt strategies, leading to poor generalization to complex unseen anomalies. To address this, we propose Vision-Guided Mixture-of-Prompts (VGMoP), a novel framework that constructs a composable expert prompt pool and introduces a vision-gated sparse Mixture-of-Experts (MoE) architecture. VGMoP enables dynamic, sparse, and task-adaptive aggregation of normal and abnormal semantic prompts, effectively overcoming the limitations of single-prompt representations. This design significantly enhances both recognition and localization of unseen anomalous categories. Evaluated on 15 industrial and medical datasets, VGMoP achieves state-of-the-art performance, with substantial average improvements in detection AUC. The results demonstrate its superior generalization capability and practical applicability.
📝 Abstract
Zero-Shot Anomaly Detection (ZSAD) aims to identify and localize anomalous regions in images of unseen object classes. While recent methods based on vision-language models like CLIP show promise, their performance is constrained by existing prompt engineering strategies. Current approaches, whether relying on single fixed, learnable, or dense dynamic prompts, suffer from a representational bottleneck and are prone to overfitting on auxiliary data, failing to generalize to the complexity and diversity of unseen anomalies. To overcome these limitations, we propose $mathtt{PromptMoE}$. Our core insight is that robust ZSAD requires a compositional approach to prompt learning. Instead of learning monolithic prompts, $mathtt{PromptMoE}$ learns a pool of expert prompts, which serve as a basis set of composable semantic primitives, and a visually-guided Mixture-of-Experts (MoE) mechanism to dynamically combine them for each instance. Our framework materializes this concept through a Visually-Guided Mixture of Prompt (VGMoP) that employs an image-gated sparse MoE to aggregate diverse normal and abnormal expert state prompts, generating semantically rich textual representations with strong generalization. Extensive experiments across 15 datasets in industrial and medical domains demonstrate the effectiveness and state-of-the-art performance of $mathtt{PromptMoE}$.