Anomaly-Aware Vision-Language Adapters for Zero-Shot Anomaly Detection

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

This work addresses the limitation of uniform feature transformation in zero-shot anomaly detection, which overlooks the inherent asymmetry between normal and anomalous data distributions. To this end, the authors propose AVA-DINO, a novel framework that introduces, for the first time, an anomaly-aware dual-branch adapter architecture built upon a frozen DINOv3 vision backbone to separately model normal and anomalous patterns. A text-guided routing mechanism, augmented with a routing regularization strategy to prevent degenerate solutions, enables context-dependent asymmetric feature activation. The framework dynamically fuses branch outputs using only input images and predefined textual prompts, without requiring task-specific fine-tuning. AVA-DINO achieves state-of-the-art performance across nine industrial and medical benchmarks, attaining a 93.5% image-level AUROC on MVTec-AD, and demonstrates strong cross-domain generalization capabilities in a zero-shot setting.

📝 Abstract

Zero-shot anomaly detection aims to identify defects in unseen categories without target-specific training. Existing methods usually apply the same feature transformation to all samples, treating normal and anomalous data uniformly despite their fundamentally asymmetric distributions, compact normals versus diverse anomalies. We instead exploit this natural asymmetry by proposing AVA-DINO, an anomaly-aware vision-language adaptation framework with dual specialized branches for normal and anomalous patterns that adapt frozen DINOv3 visual features. During training on auxiliary data, the two branches are learned jointly with a text-guided routing mechanism and explicit routing regularization that encourages branch specialization. At test time, only the input image and fixed, predefined language descriptions are used to dynamically combine the two branches, enabling an asymmetric activation. This design prevents degenerate uniform routing and allows context-specific feature transformations. Experiments across nine industrial and medical benchmarks demonstrate state-of-the-art performance, achieving 93.5% image-AUROC on MVTec-AD and strong cross-domain generalization to medical imaging without domain-specific fine-tuning. https://github.com/aqeeelmirza/AVA-DINO

Problem

Research questions and friction points this paper is trying to address.

zero-shot anomaly detection

asymmetric distributions

vision-language adaptation

normal-anomaly asymmetry

feature transformation

Innovation

Methods, ideas, or system contributions that make the work stand out.

anomaly-aware

vision-language adaptation

dual-branch specialization