🤖 AI Summary
This work addresses the challenge of recognizing subtle actions—such as glances or nods—which are difficult to model due to their small amplitude, short duration, and high inter-class ambiguity. To this end, we propose B-MoE, a novel framework that introduces, for the first time, a body-part-aware Mixture-of-Experts (MoE) mechanism into micro-action recognition. B-MoE features a lightweight Macro-Micro Motion Encoder (M3E) within a two-stream architecture, integrating expert modules partitioned by body regions and a cross-region attention-based routing mechanism to dynamically select and fuse local features from critical body parts with global motion cues. Evaluated on three benchmarks—MA-52, SocialGesture, and MPII-GroupInteraction—B-MoE achieves state-of-the-art performance, demonstrating particularly significant gains in recognition accuracy for ambiguous, low-frequency, and low-amplitude action categories.
📝 Abstract
Micro-actions, fleeting and low-amplitude motions, such as glances, nods, or minor posture shifts, carry rich social meaning but remain difficult for current action recognition models to recognize due to their subtlety, short duration, and high inter-class ambiguity. In this paper, we introduce B-MoE, a Body-part-aware Mixture-of-Experts framework designed to explicitly model the structured nature of human motion. In B-MoE, each expert specializes in a distinct body region (head, body, upper limbs, lower limbs), and is based on the lightweight Macro-Micro Motion Encoder (M3E) that captures long-range contextual structure and fine-grained local motion. A cross-attention routing mechanism learns inter-region relationships and dynamically selects the most informative regions for each micro-action. B-MoE uses a dual-stream encoder that fuses these region-specific semantic cues with global motion features to jointly capture spatially localized cues and temporally subtle variations that characterize micro-actions. Experiments on three challenging benchmarks (MA-52, SocialGesture, and MPII-GroupInteraction) show consistent state-of-theart gains, with improvements in ambiguous, underrepresented, and low amplitude classes.