SPG: Sparse-Projected Guides with Sparse Autoencoders for Zero-Shot Anomaly Detection

📅 2026-04-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses zero-shot anomaly detection and segmentation—identifying anomalies on unseen categories without target-domain adaptation—by proposing SPG, a prompt-free framework. Departing from conventional prompting strategies, SPG introduces sparse projection guidance for the first time: leveraging frozen foundation models (e.g., DINOv3 or OpenCLIP ViT-L/14@336px), it learns sparse guidance coefficients in two stages—first training a sparse autoencoder (SAE) and then optimizing only the guidance coefficients to produce normal and anomaly guidance vectors. These coefficients map sparsely to a small set of SAE dictionary atoms, revealing both category-agnostic and category-specific factors. Experiments demonstrate that SPG achieves state-of-the-art image-level detection performance under cross-dataset zero-shot settings on MVTec AD and VisA, and sets new records in pixel-level segmentation AUROC, particularly with the DINOv3 backbone.
📝 Abstract
We study zero-shot anomaly detection and segmentation using frozen foundation model features, where all learnable parameters are trained only on a labeled auxiliary dataset and deployed to unseen target categories without any target-domain adaptation. Existing prompt-based approaches use handcrafted or learned prompt embeddings as reference vectors for normal/anomalous states. We propose Sparse-Projected Guides (SPG), a prompt-free framework that learns sparse guide coefficients in the Sparse Autoencoder (SAE) latent space, which generate normal/anomaly guide vectors via the SAE dictionary. SPG employs a two stage learning strategy on the labeled auxiliary dataset: (i) train an SAE on patch-token features, and (ii) optimize only guide coefficients using auxiliary pixel-level masks while freezing the backbone and SAE. On MVTec AD and VisA under cross-dataset zero-shot settings, SPG achieves competitive image-level detection and strong pixel-level segmentation; with DINOv3, SPG attains the highest pixellevel AUROC among the compared methods. We also report SPG instantiated with OpenCLIP (ViT-L/14@336px) to align the backbone with CLIP-based baselines. Moreover, the learned guide coefficients trace decisions back to a small set of dictionary atoms, revealing category-general and category-specific factors.
Problem

Research questions and friction points this paper is trying to address.

zero-shot anomaly detection
anomaly segmentation
frozen foundation models
cross-dataset generalization
prompt-free framework
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse-Projected Guides
Sparse Autoencoder
Zero-Shot Anomaly Detection
Prompt-Free Framework
Dictionary-Based Interpretability
🔎 Similar Papers
No similar papers found.
T
Tomoyasu Nanaumi
Yachiyo Engineering Co., Ltd., Japan
Y
Yukino Tsuzuki
Yachiyo Engineering Co., Ltd., Japan
J
Junichi Okubo
Yachiyo Engineering Co., Ltd., Japan
J
Junichiro Fujii
Yachiyo Engineering Co., Ltd., Japan
Takayoshi Yamashita
Takayoshi Yamashita
Chubu University
Computer VisionPattern Recognitionobject trackingobject detectionmachine learning