🤖 AI Summary
To address the high computational cost and poor real-time deployability of existing multimodal action recognition models, this paper proposes EPAM-Net, a pose-driven efficient multimodal network. Methodologically, it introduces a lightweight pose-guided attention mechanism into a Transformer-CNN hybrid backbone for fusing RGB, optical flow, and OpenPose skeletal features—marking the first such integration. It further designs a cross-modal pose alignment loss to enhance inter-modal consistency, and incorporates differentiable pose gating and multimodal contrastive distillation. Experiments demonstrate state-of-the-art efficiency-accuracy trade-offs: EPAM-Net achieves 82.7%, 98.1%, and 76.4% top-1 accuracy on Kinetics-400, UCF101, and HMDB51, respectively, while reducing parameter count by 37% and accelerating inference by 2.1× over comparable real-time models.