TennisExpert: Towards Expert-Level Analytical Sports Video Understanding

📅 2026-03-11

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This work addresses the limitations in current tennis video understanding research, which suffers from the absence of large-scale benchmarks with fine-grained annotations and expert-level commentary, as well as challenges in building efficient real-time multimodal systems. To this end, we introduce TennisVL—the first expert-level tennis video benchmark focused on tactical analysis—and present TennisExpert, a multimodal understanding framework built upon Qwen3-VL-8B. TennisExpert incorporates a video semantic parser to extract key elements such as scores, shot sequences, ball landing positions, and player locations, and employs a hierarchical short- and long-term memory mechanism to model temporal context. Experiments demonstrate that TennisExpert significantly outperforms strong baselines including GPT-5, Gemini, and Claude on TennisVL, exhibiting superior tactical reasoning and dynamic match comprehension capabilities.

Technology Category

Application Category

📝 Abstract

Tennis is one of the most widely followed sports, generating extensive broadcast footage with strong potential for professional analysis, automated coaching, and real-time commentary. However, automatic tennis understanding remains underexplored due to two key challenges: (1) the lack of large-scale benchmarks with fine-grained annotations and expert-level commentary, and (2) the difficulty of building accurate yet efficient multimodal systems suitable for real-time deployment. To address these challenges, we introduce TennisVL, a large-scale tennis benchmark comprising over 200 professional matches (471.9 hours) and 40,000+ rally-level clips. Unlike existing commentary datasets that focus on descriptive play-by-play narration, TennisVL emphasizes expert analytical commentary capturing tactical reasoning, player decisions, and match momentum. Furthermore, we propose TennisExpert, a multimodal tennis understanding framework that integrates a video semantic parser with a memory-augmented model built on Qwen3-VL-8B. The parser extracts key match elements (e.g., scores, shot sequences, ball bounces, and player locations), while hierarchical memory modules capture both short- and long-term temporal context. Experiments show that TennisExpert consistently outperforms strong proprietary baselines, including GPT-5, Gemini, and Claude, and demonstrates improved ability to capture tactical context and match dynamics.

Problem

Research questions and friction points this paper is trying to address.

sports video understanding

expert-level commentary

multimodal systems

real-time deployment

tennis analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

expert-level sports commentary

multimodal video understanding

memory-augmented model