Dual Branch VideoMamba with Gated Class Token Fusion for Violence Detection

📅 2025-05-23
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF

career value

213K/year
🤖 AI Summary
Real-time violent action detection in surveillance videos faces dual challenges: difficulty in modeling long-range temporal dependencies and low computational efficiency. To address these, this paper proposes the first state-space model (SSM)-based approach for this task—Dual-Branch VideoMamba—which synergistically integrates CNNs’ strong local spatial modeling with SSMs’ efficient global temporal modeling. A novel gated token fusion mechanism is introduced to decouple spatiotemporal representation learning and enable adaptive feature aggregation. We further establish the first rigorously cross-dataset-isolated benchmark, comprising RWF-2000, RLVS, and VioPeru. Extensive experiments demonstrate that our method achieves state-of-the-art accuracy on this benchmark while improving inference speed by 42% and reducing FLOPs by 38%, significantly outperforming mainstream architectures including ViT and TimeSformer.

Technology Category

Application Category

📝 Abstract
The rapid proliferation of surveillance cameras has increased the demand for automated violence detection. While CNNs and Transformers have shown success in extracting spatio-temporal features, they struggle with long-term dependencies and computational efficiency. We propose Dual Branch VideoMamba with Gated Class Token Fusion (GCTF), an efficient architecture combining a dual-branch design and a state-space model (SSM) backbone where one branch captures spatial features, while the other focuses on temporal dynamics, with continuous fusion via a gating mechanism. We also present a new benchmark by merging RWF-2000, RLVS, and VioPeru datasets in video violence detection, ensuring strict separation between training and testing sets. Our model achieves state-of-the-art performance on this benchmark offering an optimal balance between accuracy and computational efficiency, demonstrating the promise of SSMs for scalable, real-time surveillance violence detection.
Problem

Research questions and friction points this paper is trying to address.

Automated violence detection in surveillance videos
Overcoming long-term dependency and computational limitations
Achieving accuracy-efficiency balance for real-time detection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual branch design captures spatial and temporal features
State-space model backbone handles long-term dependencies efficiently
Gated token fusion mechanism enhances violence detection accuracy
🔎 Similar Papers
No similar papers found.