Simplifying Traffic Anomaly Detection with Video Foundation Models

📅 2025-07-12

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

Existing traffic anomaly detection (TAD) methods heavily rely on complex, multi-stage or multimodal fusion architectures, yet the necessity of such architectural sophistication remains unvalidated. Method: This paper proposes a minimalist, pure Video Vision Transformer (ViT) encoder framework. It leverages self-supervised masked video modeling (MVM) with driving-scene-specific pretraining to learn spatiotemporal representations from unlabeled driving videos, followed by lightweight fine-tuning for unsupervised downstream anomaly detection. Contribution/Results: The proposed approach achieves state-of-the-art (SOTA) or competitive performance across multiple benchmarks, while significantly improving inference speed and reducing parameter count by orders of magnitude. Crucially, this work provides the first systematic empirical validation that high-quality, domain-aware pretraining can obviate architectural complexity in TAD. It establishes a novel paradigm—“pretraining-first, architecture-minimal”—enabling efficient, scalable, and resource-light anomaly detection.

Technology Category

Application Category

📝 Abstract

Recent methods for ego-centric Traffic Anomaly Detection (TAD) often rely on complex multi-stage or multi-representation fusion architectures, yet it remains unclear whether such complexity is necessary. Recent findings in visual perception suggest that foundation models, enabled by advanced pre-training, allow simple yet flexible architectures to outperform specialized designs. Therefore, in this work, we investigate an architecturally simple encoder-only approach using plain Video Vision Transformers (Video ViTs) and study how pre-training enables strong TAD performance. We find that: (i) strong pre-training enables simple encoder-only models to match or even surpass the performance of specialized state-of-the-art TAD methods, while also being significantly more efficient; (ii) although weakly- and fully-supervised pre-training are advantageous on standard benchmarks, we find them less effective for TAD. Instead, self-supervised Masked Video Modeling (MVM) provides the strongest signal; and (iii) Domain-Adaptive Pre-Training (DAPT) on unlabeled driving videos further improves downstream performance, without requiring anomalous examples. Our findings highlight the importance of pre-training and show that effective, efficient, and scalable TAD models can be built with minimal architectural complexity. We release our code, domain-adapted encoders, and fine-tuned models to support future work: https://github.com/tue-mps/simple-tad.

Problem

Research questions and friction points this paper is trying to address.

Investigates simple encoder-only models for Traffic Anomaly Detection

Evaluates impact of pre-training methods on detection performance

Explores domain-adaptive pre-training for improved anomaly detection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Simple encoder-only Video ViTs for TAD

Self-supervised Masked Video Modeling pre-training

Domain-Adaptive Pre-Training on driving videos

🔎 Similar Papers

Hybrid Video Anomaly Detection for Anomalous Scenarios in Autonomous Driving