Simplifying Traffic Anomaly Detection with Video Foundation Models

📅 2025-07-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing traffic anomaly detection (TAD) methods heavily rely on complex, multi-stage or multimodal fusion architectures, yet the necessity of such architectural sophistication remains unvalidated. Method: This paper proposes a minimalist, pure Video Vision Transformer (ViT) encoder framework. It leverages self-supervised masked video modeling (MVM) with driving-scene-specific pretraining to learn spatiotemporal representations from unlabeled driving videos, followed by lightweight fine-tuning for unsupervised downstream anomaly detection. Contribution/Results: The proposed approach achieves state-of-the-art (SOTA) or competitive performance across multiple benchmarks, while significantly improving inference speed and reducing parameter count by orders of magnitude. Crucially, this work provides the first systematic empirical validation that high-quality, domain-aware pretraining can obviate architectural complexity in TAD. It establishes a novel paradigm—“pretraining-first, architecture-minimal”—enabling efficient, scalable, and resource-light anomaly detection.

Technology Category

Application Category

📝 Abstract
Recent methods for ego-centric Traffic Anomaly Detection (TAD) often rely on complex multi-stage or multi-representation fusion architectures, yet it remains unclear whether such complexity is necessary. Recent findings in visual perception suggest that foundation models, enabled by advanced pre-training, allow simple yet flexible architectures to outperform specialized designs. Therefore, in this work, we investigate an architecturally simple encoder-only approach using plain Video Vision Transformers (Video ViTs) and study how pre-training enables strong TAD performance. We find that: (i) strong pre-training enables simple encoder-only models to match or even surpass the performance of specialized state-of-the-art TAD methods, while also being significantly more efficient; (ii) although weakly- and fully-supervised pre-training are advantageous on standard benchmarks, we find them less effective for TAD. Instead, self-supervised Masked Video Modeling (MVM) provides the strongest signal; and (iii) Domain-Adaptive Pre-Training (DAPT) on unlabeled driving videos further improves downstream performance, without requiring anomalous examples. Our findings highlight the importance of pre-training and show that effective, efficient, and scalable TAD models can be built with minimal architectural complexity. We release our code, domain-adapted encoders, and fine-tuned models to support future work: https://github.com/tue-mps/simple-tad.
Problem

Research questions and friction points this paper is trying to address.

Investigates simple encoder-only models for Traffic Anomaly Detection
Evaluates impact of pre-training methods on detection performance
Explores domain-adaptive pre-training for improved anomaly detection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Simple encoder-only Video ViTs for TAD
Self-supervised Masked Video Modeling pre-training
Domain-Adaptive Pre-Training on driving videos