TennisExpert: Towards Expert-Level Analytical Sports Video Understanding

📅 2026-03-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations in current tennis video understanding research, which suffers from the absence of large-scale benchmarks with fine-grained annotations and expert-level commentary, as well as challenges in building efficient real-time multimodal systems. To this end, we introduce TennisVL—the first expert-level tennis video benchmark focused on tactical analysis—and present TennisExpert, a multimodal understanding framework built upon Qwen3-VL-8B. TennisExpert incorporates a video semantic parser to extract key elements such as scores, shot sequences, ball landing positions, and player locations, and employs a hierarchical short- and long-term memory mechanism to model temporal context. Experiments demonstrate that TennisExpert significantly outperforms strong baselines including GPT-5, Gemini, and Claude on TennisVL, exhibiting superior tactical reasoning and dynamic match comprehension capabilities.

Technology Category

Application Category

📝 Abstract
Tennis is one of the most widely followed sports, generating extensive broadcast footage with strong potential for professional analysis, automated coaching, and real-time commentary. However, automatic tennis understanding remains underexplored due to two key challenges: (1) the lack of large-scale benchmarks with fine-grained annotations and expert-level commentary, and (2) the difficulty of building accurate yet efficient multimodal systems suitable for real-time deployment. To address these challenges, we introduce TennisVL, a large-scale tennis benchmark comprising over 200 professional matches (471.9 hours) and 40,000+ rally-level clips. Unlike existing commentary datasets that focus on descriptive play-by-play narration, TennisVL emphasizes expert analytical commentary capturing tactical reasoning, player decisions, and match momentum. Furthermore, we propose TennisExpert, a multimodal tennis understanding framework that integrates a video semantic parser with a memory-augmented model built on Qwen3-VL-8B. The parser extracts key match elements (e.g., scores, shot sequences, ball bounces, and player locations), while hierarchical memory modules capture both short- and long-term temporal context. Experiments show that TennisExpert consistently outperforms strong proprietary baselines, including GPT-5, Gemini, and Claude, and demonstrates improved ability to capture tactical context and match dynamics.
Problem

Research questions and friction points this paper is trying to address.

sports video understanding
expert-level commentary
multimodal systems
real-time deployment
tennis analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

expert-level sports commentary
multimodal video understanding
memory-augmented model
tactical reasoning
large-scale tennis benchmark
🔎 Similar Papers
No similar papers found.