๐ค AI Summary
Anomaly detection typically relies on normal samples from a training set as reference, yet appearance and spatial variations hinder cross-image alignment, limiting generalization. This paper proposes a novel zero-shot universal anomaly detection paradigm: without any training samples, it self-extracts intrinsic normal prototypes (INPs) solely from a single test image and leverages INPs to guide a decoder for reconstructing normal regions, using reconstruction residuals as anomaly scores. We innovatively introduce an INP consistency loss and a soft mining loss, integrating softened consistency constraints with residual learning to enable adaptive performance across all settingsโzero-shot, few-shot, semi-supervised, one-class, and multi-class. Built upon a Transformer architecture, our method employs self-attention, linear prototype composition, and INP-guided reconstruction. It achieves new state-of-the-art results on MVTec-AD, VisA, and Real-IAD, demonstrating both high localization accuracy and strong generalization capability.
๐ Abstract
Anomaly detection (AD) is essential for industrial inspection and medical diagnosis, yet existing methods typically rely on ``comparing'' test images to normal references from a training set. However, variations in appearance and positioning often complicate the alignment of these references with the test image, limiting detection accuracy. We observe that most anomalies manifest as local variations, meaning that even within anomalous images, valuable normal information remains. We argue that this information is useful and may be more aligned with the anomalies since both the anomalies and the normal information originate from the same image. Therefore, rather than relying on external normality from the training set, we propose INP-Former, a novel method that extracts Intrinsic Normal Prototypes (INPs) directly from the test image. Specifically, we introduce the INP Extractor, which linearly combines normal tokens to represent INPs. We further propose an INP Coherence Loss to ensure INPs can faithfully represent normality for the testing image. These INPs then guide the INP-guided Decoder to reconstruct only normal tokens, with reconstruction errors serving as anomaly scores. Additionally, we propose a Soft Mining Loss to prioritize hard-to-optimize samples during training. INP-Former achieves state-of-the-art performance in single-class, multi-class, and few-shot AD tasks across MVTec-AD, VisA, and Real-IAD, positioning it as a versatile and universal solution for AD. Remarkably, INP-Former also demonstrates some zero-shot AD capability. Furthermore, we propose a soft version of the INP Coherence Loss and enhance INP-Former by incorporating residual learning, leading to the development of INP-Former++. The proposed method significantly improves detection performance across single-class, multi-class, semi-supervised, few-shot, and zero-shot settings.