🤖 AI Summary
Current end-to-end models for sports video analysis suffer from weak temporal hierarchical modeling, poor generalization, high task-specific adaptation costs, and limited interpretability. To address these limitations, we propose a reconfigurable multi-agent system grounded in a cognitive toolification framework, which orchestrates specialized agents—temporal reasoning agent, event detection module, and generative summarization model—in a collaborative, role-based manner. The system enables dynamic workflow orchestration across temporal scales (from micro-level actions to macro-level strategies) and semantic levels. Its modular architecture supports iterative invocation and flexible reconfiguration, substantially enhancing generalization, interpretability, and cross-task extensibility. Evaluated on a badminton video dataset, our framework achieves unified and robust performance on both fine-grained shot-level question answering and holistic match summarization tasks.
📝 Abstract
Intelligent sports video analysis demands a comprehensive understanding of temporal context, from micro-level actions to macro-level game strategies. Existing end-to-end models often struggle with this temporal hierarchy, offering solutions that lack generalization, incur high development costs for new tasks, and suffer from poor interpretability. To overcome these limitations, we propose a reconfigurable Multi-Agent System (MAS) as a foundational framework for sports video understanding. In our system, each agent functions as a distinct "cognitive tool" specializing in a specific aspect of analysis. The system's architecture is not confined to a single temporal dimension or task. By leveraging iterative invocation and flexible composition of these agents, our framework can construct adaptive pipelines for both short-term analytic reasoning (e.g., Rally QA) and long-term generative summarization (e.g., match summaries). We demonstrate the adaptability of this framework using two representative tasks in badminton analysis, showcasing its ability to bridge fine-grained event detection and global semantic organization. This work presents a paradigm shift towards a flexible, scalable, and interpretable system for robust, cross-task sports video intelligence.The project homepage is available at https://aiden1020.github.io/COACH-project-page