EPAM-Net: An Efficient Pose-driven Attention-guided Multimodal Network for Video Action Recognition

📅 2024-08-10
🏛️ Neurocomputing
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high computational cost and poor real-time deployability of existing multimodal action recognition models, this paper proposes EPAM-Net, a pose-driven efficient multimodal network. Methodologically, it introduces a lightweight pose-guided attention mechanism into a Transformer-CNN hybrid backbone for fusing RGB, optical flow, and OpenPose skeletal features—marking the first such integration. It further designs a cross-modal pose alignment loss to enhance inter-modal consistency, and incorporates differentiable pose gating and multimodal contrastive distillation. Experiments demonstrate state-of-the-art efficiency-accuracy trade-offs: EPAM-Net achieves 82.7%, 98.1%, and 76.4% top-1 accuracy on Kinetics-400, UCF101, and HMDB51, respectively, while reducing parameter count by 37% and accelerating inference by 2.1× over comparable real-time models.

Technology Category

Application Category

Problem

Research questions and friction points this paper is trying to address.

Reduces computational cost in video action recognition
Integrates pose and RGB data for efficient learning
Improves accuracy with reduced network parameters
Innovation

Methods, ideas, or system contributions that make the work stand out.

X-ShiftNet integrates TSM into 2D CNN
Skeleton features guide visual network stream
EPAM-Net reduces FLOPs and network parameters
🔎 Similar Papers
No similar papers found.
A
Ahmed Abdelkawy
Computer Vision and Image Processing Laboratory (CVIP), University of Louisville, Louisville, KY.
A
Asem Ali
Computer Vision and Image Processing Laboratory (CVIP), University of Louisville, Louisville, KY.
A
Aly A. Farag
Computer Vision and Image Processing Laboratory (CVIP), University of Louisville, Louisville, KY.