UniMMAD: Unified Multi-Modal and Multi-Class Anomaly Detection via MoE-Driven Feature Decompression

📅 2025-09-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing anomaly detection methods often decouple modalities and categories, leading to fragmented representations and high memory overhead; multi-class reconstruction approaches relying on shared decoders struggle with cross-domain distribution shifts, causing distorted normality boundaries and high false-positive rates. This paper proposes UniMMAD—the first unified multimodal, multiclass anomaly detection framework—introducing a novel MoE-in-MoE hierarchical sparse gating architecture and a dynamic grouping filtering mechanism to enable modality- and class-aware adaptive, decoupled reconstruction. Adopting a “general-to-specific” paradigm, it supports joint encoding and sparse decompression, achieving efficient activation with 75% fewer parameters. Evaluated across nine benchmarks spanning three domains, twelve modalities, and sixty-six classes, UniMMAD achieves state-of-the-art performance, significantly reducing false positives while maintaining high efficiency and strong generalization.

Technology Category

Application Category

📝 Abstract
Existing anomaly detection (AD) methods often treat the modality and class as independent factors. Although this paradigm has enriched the development of AD research branches and produced many specialized models, it has also led to fragmented solutions and excessive memory overhead. Moreover, reconstruction-based multi-class approaches typically rely on shared decoding paths, which struggle to handle large variations across domains, resulting in distorted normality boundaries, domain interference, and high false alarm rates. To address these limitations, we propose UniMMAD, a unified framework for multi-modal and multi-class anomaly detection. At the core of UniMMAD is a Mixture-of-Experts (MoE)-driven feature decompression mechanism, which enables adaptive and disentangled reconstruction tailored to specific domains. This process is guided by a ``general to specific'' paradigm. In the encoding stage, multi-modal inputs of varying combinations are compressed into compact, general-purpose features. The encoder incorporates a feature compression module to suppress latent anomalies, encourage cross-modal interaction, and avoid shortcut learning. In the decoding stage, the general features are decompressed into modality-specific and class-specific forms via a sparsely-gated cross MoE, which dynamically selects expert pathways based on input modality and class. To further improve efficiency, we design a grouped dynamic filtering mechanism and a MoE-in-MoE structure, reducing parameter usage by 75% while maintaining sparse activation and fast inference. UniMMAD achieves state-of-the-art performance on 9 anomaly detection datasets, spanning 3 fields, 12 modalities, and 66 classes. The source code will be available at https://github.com/yuanzhao-CVLAB/UniMMAD.
Problem

Research questions and friction points this paper is trying to address.

Unified framework for multi-modal and multi-class anomaly detection
Addresses fragmented solutions and excessive memory overhead in AD
Solves domain interference and high false alarm rates in reconstruction
Innovation

Methods, ideas, or system contributions that make the work stand out.

MoE-driven feature decompression for adaptive reconstruction
Sparsely-gated cross MoE for dynamic expert selection
Grouped dynamic filtering with MoE-in-MoE structure
🔎 Similar Papers
No similar papers found.