PromptMoE: Generalizable Zero-Shot Anomaly Detection via Visually-Guided Prompt Mixtures

📅 2025-11-22

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Existing zero-shot anomaly detection (ZSAD) methods suffer from representation bottlenecks and overfitting on auxiliary data due to fixed or densely activated prompt strategies, leading to poor generalization to complex unseen anomalies. To address this, we propose Vision-Guided Mixture-of-Prompts (VGMoP), a novel framework that constructs a composable expert prompt pool and introduces a vision-gated sparse Mixture-of-Experts (MoE) architecture. VGMoP enables dynamic, sparse, and task-adaptive aggregation of normal and abnormal semantic prompts, effectively overcoming the limitations of single-prompt representations. This design significantly enhances both recognition and localization of unseen anomalous categories. Evaluated on 15 industrial and medical datasets, VGMoP achieves state-of-the-art performance, with substantial average improvements in detection AUC. The results demonstrate its superior generalization capability and practical applicability.

Technology Category

Application Category

📝 Abstract

Zero-Shot Anomaly Detection (ZSAD) aims to identify and localize anomalous regions in images of unseen object classes. While recent methods based on vision-language models like CLIP show promise, their performance is constrained by existing prompt engineering strategies. Current approaches, whether relying on single fixed, learnable, or dense dynamic prompts, suffer from a representational bottleneck and are prone to overfitting on auxiliary data, failing to generalize to the complexity and diversity of unseen anomalies. To overcome these limitations, we propose $mathtt{PromptMoE}$. Our core insight is that robust ZSAD requires a compositional approach to prompt learning. Instead of learning monolithic prompts, $mathtt{PromptMoE}$ learns a pool of expert prompts, which serve as a basis set of composable semantic primitives, and a visually-guided Mixture-of-Experts (MoE) mechanism to dynamically combine them for each instance. Our framework materializes this concept through a Visually-Guided Mixture of Prompt (VGMoP) that employs an image-gated sparse MoE to aggregate diverse normal and abnormal expert state prompts, generating semantically rich textual representations with strong generalization. Extensive experiments across 15 datasets in industrial and medical domains demonstrate the effectiveness and state-of-the-art performance of $mathtt{PromptMoE}$.

Problem

Research questions and friction points this paper is trying to address.

Overcoming limited generalization in zero-shot anomaly detection methods

Solving representational bottlenecks in vision-language prompt engineering

Addressing overfitting issues in current anomaly detection prompt strategies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Learns a pool of expert prompts as composable semantic primitives

Uses visually-guided Mixture-of-Experts for dynamic prompt combination

Employs image-gated sparse MoE to aggregate diverse expert states

🔎 Similar Papers

AnomalyCLIP: Object-agnostic Prompt Learning for Zero-shot Anomaly Detection