Spatial Autoregressive Modeling of DINOv3 Embeddings for Unsupervised Anomaly Detection

📅 2026-03-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes a novel framework for unsupervised anomaly detection that addresses key limitations of existing methods when leveraging DINOv3 features. Specifically, current approaches often neglect the spatial and contextual dependencies among image patches and rely on non-parametric modeling of normal distributions, which incurs high memory overhead. To overcome these issues, the authors introduce a two-dimensional autoregressive (AR) convolutional neural network that explicitly captures inter-patch spatial dependencies by incorporating spatial autoregressive modeling into DINOv3 embeddings for the first time. Furthermore, they replace conventional memory banks or prototype clustering with a compact parametric probabilistic model. The resulting method achieves competitive anomaly detection performance on the BMAD medical imaging benchmark while significantly reducing both inference time and memory consumption.

Technology Category

Application Category

📝 Abstract
DINO models provide rich patch-level representations that have recently enabled strong performance in unsupervised anomaly detection (UAD). Most existing methods extract patch embeddings from ``normal'' images and model them independently, ignoring spatial and neighborhood relationships between patches. This implicitly assumes that self-attention and positional encodings sufficiently encode contextual information within each patch embedding. In addition, the normative distribution is often modeled as memory banks or prototype-based representations, which require storing large numbers of features and performing costly comparisons at inference time, leading to substantial memory and computational overhead. In this work, we address both limitations by proposing a simple and efficient framework that explicitly models spatial and contextual dependencies between patch embeddings using a 2D autoregressive (AR) model. Instead of storing embeddings or clustering prototypes, our approach learns a compact parametric model of the normative distribution via an AR convolutional neural network (CNN). At test time, anomaly detection reduces to a single forward pass through the network and enables fast and memory-efficient inference. We evaluate our method on the BMAD benchmark, which comprises three medical imaging datasets, and compare it against existing work including recent DINO-based methods. Experimental results demonstrate that explicitly modeling spatial dependencies achieves competitive anomaly detection performance while substantially reducing inference time and memory requirements. Code is available at the project page: https://eerdil.github.io/spatial-ar-dinov3-uad/.
Problem

Research questions and friction points this paper is trying to address.

unsupervised anomaly detection
spatial dependencies
patch embeddings
memory efficiency
computational overhead
Innovation

Methods, ideas, or system contributions that make the work stand out.

spatial autoregressive modeling
DINOv3 embeddings
unsupervised anomaly detection
parametric normative model
memory-efficient inference
🔎 Similar Papers
No similar papers found.
Ertunc Erdil
Ertunc Erdil
Postdoctoral Researcher, Computer Vision Laboratory, ETH Zurich
Machine LearningComputer VisionMedical Image Analysis
N
Nico Schulthess
Computer Vision Lab. ETH Zurich, Zurich, Switzerland
G
Guney Tombak
Computer Vision Lab. ETH Zurich, Zurich, Switzerland
Ender Konukoglu
Ender Konukoglu
ETH Zurich
Medical Image AnalysisBiophysical Modeling