MS-Temba : Multi-Scale Temporal Mamba for Efficient Temporal Action Detection

πŸ“… 2025-01-10
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the challenge of jointly modeling diverse action scales and dense spatiotemporal patterns in hour-long untrimmed videos while maintaining computational efficiency, this paper pioneers the adaptation of the Mamba architecture to action detection. We propose a Dilated Temporal State-space model (DTS) and a multi-scale Temba Fuser. DTS employs dilated convolutions to expand temporal receptive fields and integrates a lightweight Local–Global module (TLM+DTS) to jointly capture short-range actions and long-range inter-segment dependencies. This design enables joint optimization of parameter efficiency and multi-scale representation learning. Our method achieves significant improvements over state-of-the-art methods on three major long-video benchmarks, while matching baseline performance on short videos. With only 1/8 the parameters of the best-performing Transformer-based approach, it attains substantially higher throughput and enhanced deployment feasibility.

Technology Category

Application Category

πŸ“ Abstract
Action detection in real-world scenarios is particularly challenging due to densely distributed actions in hour-long untrimmed videos. It requires modeling both short- and long-term temporal relationships while handling significant intra-class temporal variations. Previous state-of-the-art (SOTA) Transformer-based architectures, though effective, are impractical for real-world deployment due to their high parameter count, GPU memory usage, and limited throughput, making them unsuitable for very long videos. In this work, we innovatively adapt the Mamba architecture for action detection and propose Multi-scale Temporal Mamba (MS-Temba), comprising two key components: Temporal Mamba (Temba) Blocks and the Temporal Mamba Fuser. Temba Blocks include the Temporal Local Module (TLM) for short-range temporal modeling and the Dilated Temporal SSM (DTS) for long-range dependencies. By introducing dilations, a novel concept for Mamba, TLM and DTS capture local and global features at multiple scales. The Temba Fuser aggregates these scale-specific features using Mamba to learn comprehensive multi-scale representations of untrimmed videos. MS-Temba is validated on three public datasets, outperforming SOTA methods on long videos and matching prior methods on short videos while using only one-eighth of the parameters.
Problem

Research questions and friction points this paper is trying to address.

Action Recognition
Long Videos
Transformer Methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Scale Temporal Mamba (MS-Temba)
Time Local Module (TLM)
Dilated Temporal Segment (DTS)
πŸ”Ž Similar Papers
No similar papers found.