🤖 AI Summary
This work proposes a novel framework for unsupervised anomaly detection that addresses key limitations of existing methods when leveraging DINOv3 features. Specifically, current approaches often neglect the spatial and contextual dependencies among image patches and rely on non-parametric modeling of normal distributions, which incurs high memory overhead. To overcome these issues, the authors introduce a two-dimensional autoregressive (AR) convolutional neural network that explicitly captures inter-patch spatial dependencies by incorporating spatial autoregressive modeling into DINOv3 embeddings for the first time. Furthermore, they replace conventional memory banks or prototype clustering with a compact parametric probabilistic model. The resulting method achieves competitive anomaly detection performance on the BMAD medical imaging benchmark while significantly reducing both inference time and memory consumption.
📝 Abstract
DINO models provide rich patch-level representations that have recently enabled strong performance in unsupervised anomaly detection (UAD). Most existing methods extract patch embeddings from ``normal'' images and model them independently, ignoring spatial and neighborhood relationships between patches. This implicitly assumes that self-attention and positional encodings sufficiently encode contextual information within each patch embedding. In addition, the normative distribution is often modeled as memory banks or prototype-based representations, which require storing large numbers of features and performing costly comparisons at inference time, leading to substantial memory and computational overhead. In this work, we address both limitations by proposing a simple and efficient framework that explicitly models spatial and contextual dependencies between patch embeddings using a 2D autoregressive (AR) model. Instead of storing embeddings or clustering prototypes, our approach learns a compact parametric model of the normative distribution via an AR convolutional neural network (CNN). At test time, anomaly detection reduces to a single forward pass through the network and enables fast and memory-efficient inference. We evaluate our method on the BMAD benchmark, which comprises three medical imaging datasets, and compare it against existing work including recent DINO-based methods. Experimental results demonstrate that explicitly modeling spatial dependencies achieves competitive anomaly detection performance while substantially reducing inference time and memory requirements. Code is available at the project page: https://eerdil.github.io/spatial-ar-dinov3-uad/.