MSF-Mamba: Motion-aware State Fusion Mamba for Efficient Micro-Gesture Recognition

📅 2025-10-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Micro-action recognition demands joint modeling of long-range spatiotemporal dependencies and fine-grained local motion patterns; however, CNNs struggle with long-range relationships, Transformers incur prohibitive computational overhead, and Mamba—a one-dimensional state space model (SSM)—lacks intrinsic local spatiotemporal awareness. To address this, we propose Motion-aware State Fusion Mamba (MSF-Mamba), the first SSM-based architecture incorporating Central Frame Differencing (CFD) for explicit motion representation, a Local Context State Fusion module for enhanced local spatiotemporal modeling, and a multi-scale adaptive weighting mechanism to capture dynamic features across scales. MSF-Mamba preserves Mamba’s linear-time complexity while significantly improving micro-action modeling fidelity. Evaluated on two public micro-gesture benchmarks, it achieves state-of-the-art performance, consistently outperforming CNNs, Transformers, and existing SSM approaches.

Technology Category

Application Category

📝 Abstract
Micro-gesture recognition (MGR) targets the identification of subtle and fine-grained human motions and requires accurate modeling of both long-range and local spatiotemporal dependencies. While CNNs are effective at capturing local patterns, they struggle with long-range dependencies due to their limited receptive fields. Transformer-based models address this limitation through self-attention mechanisms but suffer from high computational costs. Recently, Mamba has shown promise as an efficient model, leveraging state space models (SSMs) to enable linear-time processing However, directly applying the vanilla Mamba to MGR may not be optimal. This is because Mamba processes inputs as 1D sequences, with state updates relying solely on the previous state, and thus lacks the ability to model local spatiotemporal dependencies. In addition, previous methods lack a design of motion-awareness, which is crucial in MGR. To overcome these limitations, we propose motion-aware state fusion mamba (MSF-Mamba), which enhances Mamba with local spatiotemporal modeling by fusing local contextual neighboring states. Our design introduces a motion-aware state fusion module based on central frame difference (CFD). Furthermore, a multiscale version named MSF-Mamba+ has been proposed. Specifically, MSF-Mamba supports multiscale motion-aware state fusion, as well as an adaptive scale weighting module that dynamically weighs the fused states across different scales. These enhancements explicitly address the limitations of vanilla Mamba by enabling motion-aware local spatiotemporal modeling, allowing MSF-Mamba and MSF-Mamba to effectively capture subtle motion cues for MGR. Experiments on two public MGR datasets demonstrate that even the lightweight version, namely, MSF-Mamba, achieves SoTA performance, outperforming existing CNN-, Transformer-, and SSM-based models while maintaining high efficiency.
Problem

Research questions and friction points this paper is trying to address.

Enhancing Mamba for micro-gesture recognition with local spatiotemporal modeling
Addressing vanilla Mamba's inability to capture local motion dependencies
Overcoming computational inefficiency in Transformer-based gesture recognition models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fuses local contextual states for spatiotemporal modeling
Introduces motion-aware module using central frame difference
Employs multiscale fusion with adaptive weighting mechanism
🔎 Similar Papers
No similar papers found.