🤖 AI Summary
Micro-action recognition demands joint modeling of long-range spatiotemporal dependencies and fine-grained local motion patterns; however, CNNs struggle with long-range relationships, Transformers incur prohibitive computational overhead, and Mamba—a one-dimensional state space model (SSM)—lacks intrinsic local spatiotemporal awareness. To address this, we propose Motion-aware State Fusion Mamba (MSF-Mamba), the first SSM-based architecture incorporating Central Frame Differencing (CFD) for explicit motion representation, a Local Context State Fusion module for enhanced local spatiotemporal modeling, and a multi-scale adaptive weighting mechanism to capture dynamic features across scales. MSF-Mamba preserves Mamba’s linear-time complexity while significantly improving micro-action modeling fidelity. Evaluated on two public micro-gesture benchmarks, it achieves state-of-the-art performance, consistently outperforming CNNs, Transformers, and existing SSM approaches.
📝 Abstract
Micro-gesture recognition (MGR) targets the identification of subtle and fine-grained human motions and requires accurate modeling of both long-range and local spatiotemporal dependencies. While CNNs are effective at capturing local patterns, they struggle with long-range dependencies due to their limited receptive fields. Transformer-based models address this limitation through self-attention mechanisms but suffer from high computational costs. Recently, Mamba has shown promise as an efficient model, leveraging state space models (SSMs) to enable linear-time processing However, directly applying the vanilla Mamba to MGR may not be optimal. This is because Mamba processes inputs as 1D sequences, with state updates relying solely on the previous state, and thus lacks the ability to model local spatiotemporal dependencies. In addition, previous methods lack a design of motion-awareness, which is crucial in MGR. To overcome these limitations, we propose motion-aware state fusion mamba (MSF-Mamba), which enhances Mamba with local spatiotemporal modeling by fusing local contextual neighboring states. Our design introduces a motion-aware state fusion module based on central frame difference (CFD). Furthermore, a multiscale version named MSF-Mamba+ has been proposed. Specifically, MSF-Mamba supports multiscale motion-aware state fusion, as well as an adaptive scale weighting module that dynamically weighs the fused states across different scales. These enhancements explicitly address the limitations of vanilla Mamba by enabling motion-aware local spatiotemporal modeling, allowing MSF-Mamba and MSF-Mamba to effectively capture subtle motion cues for MGR. Experiments on two public MGR datasets demonstrate that even the lightweight version, namely, MSF-Mamba, achieves SoTA performance, outperforming existing CNN-, Transformer-, and SSM-based models while maintaining high efficiency.