Foundation Models and Transformers for Anomaly Detection: A Survey

📅 2025-07-21

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This survey systematically examines recent advances in vision anomaly detection (VAD) leveraging Transformers and foundation models, addressing core challenges: modeling long-range dependencies, insufficient contextual awareness, and scarcity of annotated data. It categorizes prevailing paradigms into reconstruction-based, feature-mapping-based, and zero-/few-shot learning approaches, highlighting how foundation models drive a paradigm shift via global self-attention mechanisms, large-scale self-supervised pretraining, and unified representation spaces. Compared to conventional methods, these approaches substantially improve detection robustness, cross-domain generalization, and interpretability, while enabling deployment under low-resource constraints. The paper establishes a structured benchmark by summarizing state-of-the-art performance boundaries, identifying common limitations—including computational overhead and ambiguous anomaly definitions—and proposing future directions such as multimodal alignment and integration of causal reasoning. It thus provides both empirical guidance and theoretical grounding for advancing VAD research.

Technology Category

Application Category

📝 Abstract

In line with the development of deep learning, this survey examines the transformative role of Transformers and foundation models in advancing visual anomaly detection (VAD). We explore how these architectures, with their global receptive fields and adaptability, address challenges such as long-range dependency modeling, contextual modeling and data scarcity. The survey categorizes VAD methods into reconstruction-based, feature-based and zero/few-shot approaches, highlighting the paradigm shift brought about by foundation models. By integrating attention mechanisms and leveraging large-scale pre-training, Transformers and foundation models enable more robust, interpretable, and scalable anomaly detection solutions. This work provides a comprehensive review of state-of-the-art techniques, their strengths, limitations, and emerging trends in leveraging these architectures for VAD.

Problem

Research questions and friction points this paper is trying to address.

Advancing visual anomaly detection using Transformers and foundation models

Addressing challenges like long-range dependency and data scarcity

Reviewing state-of-the-art techniques and emerging trends in VAD

Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformers enhance anomaly detection via attention mechanisms

Foundation models enable robust zero/few-shot VAD approaches

Large-scale pre-training improves interpretability and scalability

🔎 Similar Papers

What Information Contributes to Log-based Anomaly Detection? Insights from a Configurable Transformer-Based Approach