Beyond Pixels: Leveraging the Language of Soccer to Improve Spatio-Temporal Action Detection in Broadcast Videos

📅 2025-05-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In spatiotemporal action detection (STAD) for soccer broadcast videos, insufficient contextual modeling leads to high recall but low precision—particularly due to false positives arising from ambiguous pixel-level cues. Method: We propose a structured sequence modeling approach grounded in a novel “soccer language” abstraction, where matches are formalized as game-state-guided sequences. We design a state-aware denoising sequence transduction task enabling joint spatiotemporal reasoning across players and teams. Our method employs a Transformer encoder-decoder architecture that jointly models action predictions and multidimensional game states, augmented with extended temporal context and explicit tactical dependency modeling. Contribution/Results: The framework significantly improves both precision and recall in low-confidence regions, effectively bridging the semantic gap inherent in purely pixel-based methods. It establishes an interpretable, generalizable high-level reasoning paradigm for sports video understanding—demonstrating principled integration of domain knowledge into deep learning architectures.

Technology Category

Application Category

📝 Abstract
State-of-the-art spatio-temporal action detection (STAD) methods show promising results for extracting soccer events from broadcast videos. However, when operated in the high-recall, low-precision regime required for exhaustive event coverage in soccer analytics, their lack of contextual understanding becomes apparent: many false positives could be resolved by considering a broader sequence of actions and game-state information. In this work, we address this limitation by reasoning at the game level and improving STAD through the addition of a denoising sequence transduction task. Sequences of noisy, context-free player-centric predictions are processed alongside clean game state information using a Transformer-based encoder-decoder model. By modeling extended temporal context and reasoning jointly over team-level dynamics, our method leverages the"language of soccer"- its tactical regularities and inter-player dependencies - to generate"denoised"sequences of actions. This approach improves both precision and recall in low-confidence regimes, enabling more reliable event extraction from broadcast video and complementing existing pixel-based methods.
Problem

Research questions and friction points this paper is trying to address.

Improving spatio-temporal action detection in soccer videos
Reducing false positives using contextual game information
Enhancing precision and recall with tactical sequence modeling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer-based encoder-decoder model for denoising
Joint reasoning over team-level dynamics
Leveraging tactical regularities in soccer
🔎 Similar Papers
No similar papers found.
J
Jeremie Ochin
Centre for Robotics, Mines Paris - PSL, France; Footovision, Paris, France
Raphael Chekroun
Raphael Chekroun
Quant Researcher @ Qube RT
Autonomous DrivingReinforcement LearningImitation LearningComputer VisionDeep Learning
Bogdan Stanciulescu
Bogdan Stanciulescu
Mines ParisTech
computer visionmachine learningrobotics
S
Sotiris Manitsaris
Centre for Robotics, Mines Paris - PSL, France