🤖 AI Summary
Existing monocular football video analysis methods for detecting ball-related events (e.g., passes, shots) rely heavily on costly manual annotations and lack explicit modeling of match context. To address this, we propose a semantic-aware spatiotemporal action detection framework. Our method explicitly encodes dynamic match state—including player positions, velocities, and team affiliations—as a temporal graph structure, which is jointly optimized end-to-end with visual features extracted via 3D convolutional networks. A graph neural network (GNN) models player interactions to capture tactical semantics. Experiments on real-world match footage demonstrate significant improvements in spatiotemporal action detection mean Average Precision (mAP), confirming that match-state information provides substantial predictive gain over vision-only baselines. This work establishes a new paradigm for low-cost, high-accuracy sports understanding by unifying geometric, kinematic, and strategic cues within a learnable graph-structured representation.
📝 Abstract
Soccer analytics rely on two data sources: the player positions on the pitch and the sequences of events they perform. With around 2000 ball events per game, their precise and exhaustive annotation based on a monocular video stream remains a tedious and costly manual task. While state-of-the-art spatio-temporal action detection methods show promise for automating this task, they lack contextual understanding of the game. Assuming professional players' behaviors are interdependent, we hypothesize that incorporating surrounding players' information such as positions, velocity and team membership can enhance purely visual predictions. We propose a spatio-temporal action detection approach that combines visual and game state information via Graph Neural Networks trained end-to-end with state-of-the-art 3D CNNs, demonstrating improved metrics through game state integration.