🤖 AI Summary
Single-view anomaly detection suffers from viewpoint bias, leading to inaccurate sample-level predictions. To address this, we propose a Multi-View Anomaly Detection (MVAD) framework centered on the Multi-View Adaptive Selection (MVAS) algorithm, enabling cross-view feature learning and fusion. Our key contributions include: (i) the first neighborhood-aware attention window mechanism for semantic correlation modeling, supporting dynamic window sizing and top-K sparsity pruning to achieve linear computational complexity; and (ii) the first unified optimization—under both one-class and multi-class settings—for joint anomaly localization at sample-, image-, and pixel-levels. The method comprises multi-view feature encoding, neighborhood window partitioning, cross-view semantic correlation matrix construction, and a lightweight fusion network. On Real-IAD, MVAD achieves state-of-the-art performance across all ten metrics: +4.1% (sample-level), +5.6% (image-level), and +6.7% (pixel-level) AUROC, with only 18M parameters—significantly reducing GPU memory footprint and training cost.
📝 Abstract
This study explores the recently proposed challenging multi-view Anomaly Detection (AD) task. Single-view tasks would encounter blind spots from other perspectives, resulting in inaccuracies in sample-level prediction. Therefore, we introduce the extbf{M}ulti- extbf{V}iew extbf{A}nomaly extbf{D}etection ( extbf{MVAD}) framework, which learns and integrates features from multi-views. Specifically, we proposed a extbf{M}ulti- extbf{V}iew extbf{A}daptive extbf{S}election ( extbf{MVAS}) algorithm for feature learning and fusion across multiple views. The feature maps are divided into neighbourhood attention windows to calculate a semantic correlation matrix between single-view windows and all other views, which is a conducted attention mechanism for each single-view window and the top-K most correlated multi-view windows. Adjusting the window sizes and top-K can minimise the computational complexity to linear. Extensive experiments on the Real-IAD dataset for cross-setting (multi/single-class) validate the effectiveness of our approach, achieving state-of-the-art performance among sample extbf{4.1%}$uparrow$/ image extbf{5.6%}$uparrow$/pixel extbf{6.7%}$uparrow$ levels with a total of ten metrics with only extbf{18M} parameters and fewer GPU memory and training time.