Dual Branch VideoMamba with Gated Class Token Fusion for Violence Detection

📅 2025-05-23
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Real-time violent action detection in surveillance videos faces dual challenges: difficulty in modeling long-range temporal dependencies and low computational efficiency. To address these, this paper proposes the first state-space model (SSM)-based approach for this task—Dual-Branch VideoMamba—which synergistically integrates CNNs’ strong local spatial modeling with SSMs’ efficient global temporal modeling. A novel gated token fusion mechanism is introduced to decouple spatiotemporal representation learning and enable adaptive feature aggregation. We further establish the first rigorously cross-dataset-isolated benchmark, comprising RWF-2000, RLVS, and VioPeru. Extensive experiments demonstrate that our method achieves state-of-the-art accuracy on this benchmark while improving inference speed by 42% and reducing FLOPs by 38%, significantly outperforming mainstream architectures including ViT and TimeSformer.

Technology Category

Application Category

📝 Abstract
The rapid proliferation of surveillance cameras has increased the demand for automated violence detection. While CNNs and Transformers have shown success in extracting spatio-temporal features, they struggle with long-term dependencies and computational efficiency. We propose Dual Branch VideoMamba with Gated Class Token Fusion (GCTF), an efficient architecture combining a dual-branch design and a state-space model (SSM) backbone where one branch captures spatial features, while the other focuses on temporal dynamics, with continuous fusion via a gating mechanism. We also present a new benchmark by merging RWF-2000, RLVS, and VioPeru datasets in video violence detection, ensuring strict separation between training and testing sets. Our model achieves state-of-the-art performance on this benchmark offering an optimal balance between accuracy and computational efficiency, demonstrating the promise of SSMs for scalable, real-time surveillance violence detection.
Problem

Research questions and friction points this paper is trying to address.

Automated violence detection in surveillance videos
Overcoming long-term dependency and computational limitations
Achieving accuracy-efficiency balance for real-time detection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual branch design captures spatial and temporal features
State-space model backbone handles long-term dependencies efficiently
Gated token fusion mechanism enhances violence detection accuracy
🔎 Similar Papers
No similar papers found.
D
D. C. Senadeera
School of Electronic Engineering and Computer Science, Queen Mary University of London, UK
Xiaoyun Yang
Xiaoyun Yang
Remark AI UK Limited
Medical ImagingMachine LearningComputer Vision
Dimitrios Kollias
Dimitrios Kollias
Associate Professor in Multimodal AI at Queen Mary University of London
Multimodal AIDeep Learning & Computer VisionBehavior AnalysisHMIMedical Imaging & Healthcare
G
Gregory G. Slabaugh
School of Electronic Engineering and Computer Science, Queen Mary University of London, UK