PromptMoE: Generalizable Zero-Shot Anomaly Detection via Visually-Guided Prompt Mixtures

📅 2025-11-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing zero-shot anomaly detection (ZSAD) methods suffer from representation bottlenecks and overfitting on auxiliary data due to fixed or densely activated prompt strategies, leading to poor generalization to complex unseen anomalies. To address this, we propose Vision-Guided Mixture-of-Prompts (VGMoP), a novel framework that constructs a composable expert prompt pool and introduces a vision-gated sparse Mixture-of-Experts (MoE) architecture. VGMoP enables dynamic, sparse, and task-adaptive aggregation of normal and abnormal semantic prompts, effectively overcoming the limitations of single-prompt representations. This design significantly enhances both recognition and localization of unseen anomalous categories. Evaluated on 15 industrial and medical datasets, VGMoP achieves state-of-the-art performance, with substantial average improvements in detection AUC. The results demonstrate its superior generalization capability and practical applicability.

Technology Category

Application Category

📝 Abstract
Zero-Shot Anomaly Detection (ZSAD) aims to identify and localize anomalous regions in images of unseen object classes. While recent methods based on vision-language models like CLIP show promise, their performance is constrained by existing prompt engineering strategies. Current approaches, whether relying on single fixed, learnable, or dense dynamic prompts, suffer from a representational bottleneck and are prone to overfitting on auxiliary data, failing to generalize to the complexity and diversity of unseen anomalies. To overcome these limitations, we propose $mathtt{PromptMoE}$. Our core insight is that robust ZSAD requires a compositional approach to prompt learning. Instead of learning monolithic prompts, $mathtt{PromptMoE}$ learns a pool of expert prompts, which serve as a basis set of composable semantic primitives, and a visually-guided Mixture-of-Experts (MoE) mechanism to dynamically combine them for each instance. Our framework materializes this concept through a Visually-Guided Mixture of Prompt (VGMoP) that employs an image-gated sparse MoE to aggregate diverse normal and abnormal expert state prompts, generating semantically rich textual representations with strong generalization. Extensive experiments across 15 datasets in industrial and medical domains demonstrate the effectiveness and state-of-the-art performance of $mathtt{PromptMoE}$.
Problem

Research questions and friction points this paper is trying to address.

Overcoming limited generalization in zero-shot anomaly detection methods
Solving representational bottlenecks in vision-language prompt engineering
Addressing overfitting issues in current anomaly detection prompt strategies
Innovation

Methods, ideas, or system contributions that make the work stand out.

Learns a pool of expert prompts as composable semantic primitives
Uses visually-guided Mixture-of-Experts for dynamic prompt combination
Employs image-gated sparse MoE to aggregate diverse expert states
🔎 Similar Papers
2023-10-29International Conference on Learning RepresentationsCitations: 114