🤖 AI Summary
This work addresses the challenge of predicting shot direction in soccer penalty kicks, where goalkeepers must anticipate the kicker’s intent within extremely limited time. The authors propose a low-latency early prediction method that leverages spatiotemporal embeddings from a pretrained human action recognition (HAR) model, combined with a lightweight Mamba state-space model to capture temporal dynamics from short video clips before and after ball contact. Contextual metadata—such as field lateral position and the kicker’s dominant foot—are integrated to enhance prediction accuracy. Notably, this is the first approach to combine pretrained HAR embeddings with the Mamba architecture for sports intention prediction, eliminating the need for handcrafted biomechanical features or explicit kinematic reconstruction, thereby improving practicality and generalization. Experiments demonstrate 53.1% accuracy on a three-class task and 64.5% on a binary task, outperforming or matching current strong baselines.
📝 Abstract
Penalty kicks in soccer are decided under extreme time constraints, where goalkeepers benefit from anticipating shot direction from the kickers motion before or around ball contact. In this paper, MambaKick is presented as a learning-based framework for penalty direction prediction that leverages pretrained human action recognition (HAR) embeddings extracted from contact-centered short video segments and combines them with a lightweight temporal predictor. Rather than relying on explicit kinematic reconstruction or handcrafted biomechanical features, the approach reuses transferable spatiotemporal representations and utilizes selective state-spare models (Mamba) for efficient sequence aggregation. Simple contextual metadata (e.g., field side and footedness) are also considered as complementary cues that may reduce ambiguity in real-world footage. Across a range of HAR backbones, MambaKick consistently improves or matches strong embedding baselines, achieving up to 53.1% accuracy for three classes and 64.5% for two classes under the proposed methodology. Overall, the results indicate that combining pretrained HAR representations with efficient state-space temporal modeling is a practical direction for low-latency intention prediction in real-world sports video. The code will be available at GitHub: https://github.com/hvelesaca/MambaKick/