🤖 AI Summary
Existing football video understanding methods struggle to automatically generate reliable, fine-grained play-by-play annotations due to the absence of tactical semantic grounding in action recognition and the disconnection between multimodal visual outputs (e.g., tracking, identity recognition) and long-term tactical patterns. This paper introduces the first multimodal, multi-agent action recognition framework integrated with tactical context priors—grounded in agent-level state perception and team-level behavioral reasoning—and jointly modeling multi-object tracking, player re-identification, spatiotemporal action detection, and long-horizon tactical modeling for frame-level action localization across full matches. Contributions include: (1) the first full-match tactical-aware benchmark; (2) a scalable paradigm for multi-person, multi-agent action localization; and (3) significantly improved reliability of automated annotation, enabling high-quality, structured play-by-play streams for data-driven football analytics.
📝 Abstract
Soccer video understanding has motivated the creation of datasets for tasks such as temporal action localization, spatiotemporal action detection (STAD), or multiobject tracking (MOT). The annotation of structured sequences of events (who does what, when, and where) used for soccer analytics requires a holistic approach that integrates both STAD and MOT. However, current action recognition methods remain insufficient for constructing reliable play-by-play data and are typically used to assist rather than fully automate annotation. Parallel research has advanced tactical modeling, trajectory forecasting, and performance analysis, all grounded in game-state and play-by-play data. This motivates leveraging tactical knowledge as a prior to support computer-vision-based predictions, enabling more automated and reliable extraction of play-by-play data. We introduce Footovision Play-by-Play Action Spotting in Soccer Dataset (FOOTPASS), the first benchmark for play-by-play action spotting over entire soccer matches in a multi-modal, multi-agent tactical context. It enables the development of methods for player-centric action spotting that exploit both outputs from computer-vision tasks (e.g., tracking, identification) and prior knowledge of soccer, including its tactical regularities over long time horizons, to generate reliable play-by-play data streams. These streams form an essential input for data-driven sports analytics.