One Language-Free Foundation Model Is Enough for Universal Vision Anomaly Detection

📅 2026-01-09

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

This work proposes UniADet, a novel framework for universal visual anomaly detection that eliminates the need for complex prompt engineering and adaptation strategies, thereby enhancing flexibility and generalization. Notably, UniADet demonstrates for the first time that language encoders are unnecessary in zero-shot anomaly detection. By fully decoupling the detection task from cross-level features, the method introduces only 0.002 million learnable parameters and requires no fine-tuning, achieving remarkable parameter efficiency and model-agnostic zero-shot performance. Evaluated across 14 real-world industrial and medical benchmarks, UniADet significantly outperforms existing zero-shot and few-shot approaches and, remarkably, surpasses fully supervised methods for the first time in this domain.

Technology Category

Application Category

📝 Abstract

Universal visual anomaly detection (AD) aims to identify anomaly images and segment anomaly regions towards open and dynamic scenarios, following zero- and few-shot paradigms without any dataset-specific fine-tuning. We have witnessed significant progress in widely use of visual-language foundational models in recent approaches. However, current methods often struggle with complex prompt engineering, elaborate adaptation modules, and challenging training strategies, ultimately limiting their flexibility and generality. To address these issues, this paper rethinks the fundamental mechanism behind visual-language models for AD and presents an embarrassingly simple, general, and effective framework for Universal vision Anomaly Detection (UniADet). Specifically, we first find language encoder is used to derive decision weights for anomaly classification and segmentation, and then demonstrate that it is unnecessary for universal AD. Second, we propose an embarrassingly simple method to completely decouple classification and segmentation, and decouple cross-level features, i.e., learning independent weights for different tasks and hierarchical features. UniADet is highly simple (learning only decoupled weights), parameter-efficient (only 0.002M learnable parameters), general (adapting a variety of foundation models), and effective (surpassing state-of-the-art zero-/few-shot by a large margin and even full-shot AD methods for the first time) on 14 real-world AD benchmarks covering both industrial and medical domains. We will make the code and model of UniADet available at https://github.com/gaobb/UniADet.

Problem

Research questions and friction points this paper is trying to address.

universal visual anomaly detection

visual-language foundation models

prompt engineering

adaptation modules

training strategies

Innovation

Methods, ideas, or system contributions that make the work stand out.

language-free foundation model

universal anomaly detection

task decoupling