🤖 AI Summary
This study addresses two core challenges in real-world visual anomaly detection: robustness to distributional shifts and few-shot generalization. Leveraging the VAND 3.0 challenge, we propose a novel evaluation framework that jointly emphasizes out-of-distribution robustness and few-shot adaptability. We systematically investigate large-scale pre-trained vision and vision-language models (e.g., ViT, CLIP), introducing a backbone-driven pipeline for feature fusion and few-shot adaptation—incorporating optimized fine-tuning strategies and cross-modal feature alignment. Experiments demonstrate substantial improvements over baselines on both tasks, validating the critical contribution of advanced backbone architectures. Furthermore, our analysis uncovers a fundamental trade-off between computational efficiency and real-time deployability, highlighting bottlenecks in current approaches. The work thus provides new insights, practical design principles, and a rigorous benchmark for lightweight, robust, and production-ready anomaly detection systems.
📝 Abstract
Visual anomaly detection is a strongly application-driven field of research. Consequently, the connection between academia and industry is of paramount importance. In this regard, we present the VAND 3.0 Challenge to showcase current progress in anomaly detection across different practical settings whilst addressing critical issues in the field. The challenge hosted two tracks, fostering the development of anomaly detection methods robust against real-world distribution shifts (Category 1) and exploring the capabilities of Vision Language Models within the few-shot regime (Category 2), respectively. The participants' solutions reached significant improvements over previous baselines by combining or adapting existing approaches and fusing them with novel pipelines. While for both tracks the progress in large pre-trained vision (language) backbones played a pivotal role for the performance increase, scaling up anomaly detection methods more efficiently needs to be addressed by future research to meet real-time and computational constraints on-site.