🤖 AI Summary
This work addresses the fine-grained action classification task of fall detection in figure skating. We propose a dual-path multimodal framework that fuses RGB frames and skeleton keypoint Gaussian heatmaps. A novel gated shift-enhancement architecture is introduced to jointly enable early fusion (input-level heatmap concatenation) and late fusion (cross-modal attention-driven multi-stream feature fusion). To our knowledge, this is the first systematic validation demonstrating the decisive performance gain conferred by the skeleton modality for discriminative analysis of complex on-ice movements. Evaluated on our newly constructed FR-FS dataset, the model achieves 98.08% accuracy using ResNet18 as backbone—outperforming the RGB-only baseline by 40%. With ResNet50, it still yields a 20% improvement. These results significantly advance the state of fine-grained sports action recognition.
📝 Abstract
This paper introduces Gate-Shift-Pose, an enhanced version of Gate-Shift-Fuse networks, designed for athlete fall classification in figure skating by integrating skeleton pose data alongside RGB frames. We evaluate two fusion strategies: early-fusion, which combines RGB frames with Gaussian heatmaps of pose keypoints at the input stage, and late-fusion, which employs a multi-stream architecture with attention mechanisms to combine RGB and pose features. Experiments on the FR-FS dataset demonstrate that Gate-Shift-Pose significantly outperforms the RGB-only baseline, improving accuracy by up to 40% with ResNet18 and 20% with ResNet50. Early-fusion achieves the highest accuracy (98.08%) with ResNet50, leveraging the model's capacity for effective multimodal integration, while late-fusion is better suited for lighter backbones like ResNet18. These results highlight the potential of multimodal architectures for sports action recognition and the critical role of skeleton pose information in capturing complex motion patterns.