🤖 AI Summary
In spatiotemporal action detection (STAD) for soccer broadcast videos, insufficient contextual modeling leads to high recall but low precision—particularly due to false positives arising from ambiguous pixel-level cues. Method: We propose a structured sequence modeling approach grounded in a novel “soccer language” abstraction, where matches are formalized as game-state-guided sequences. We design a state-aware denoising sequence transduction task enabling joint spatiotemporal reasoning across players and teams. Our method employs a Transformer encoder-decoder architecture that jointly models action predictions and multidimensional game states, augmented with extended temporal context and explicit tactical dependency modeling. Contribution/Results: The framework significantly improves both precision and recall in low-confidence regions, effectively bridging the semantic gap inherent in purely pixel-based methods. It establishes an interpretable, generalizable high-level reasoning paradigm for sports video understanding—demonstrating principled integration of domain knowledge into deep learning architectures.
📝 Abstract
State-of-the-art spatio-temporal action detection (STAD) methods show promising results for extracting soccer events from broadcast videos. However, when operated in the high-recall, low-precision regime required for exhaustive event coverage in soccer analytics, their lack of contextual understanding becomes apparent: many false positives could be resolved by considering a broader sequence of actions and game-state information. In this work, we address this limitation by reasoning at the game level and improving STAD through the addition of a denoising sequence transduction task. Sequences of noisy, context-free player-centric predictions are processed alongside clean game state information using a Transformer-based encoder-decoder model. By modeling extended temporal context and reasoning jointly over team-level dynamics, our method leverages the"language of soccer"- its tactical regularities and inter-player dependencies - to generate"denoised"sequences of actions. This approach improves both precision and recall in low-confidence regimes, enabling more reliable event extraction from broadcast video and complementing existing pixel-based methods.