Stepping VLMs onto the Court: Benchmarking Spatial Intelligence in Sports

📅 2026-03-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited spatial reasoning capabilities of current vision-language models (VLMs) in complex, dynamic sports scenarios, compounded by the absence of dedicated evaluation benchmarks and large-scale datasets. To bridge this gap, the study introduces sports scenes as a systematic testbed for spatial intelligence and proposes a scalable data construction method grounded in court geometry. Leveraging semi-automatic scene reconstruction, the authors generate over one million question-answer pairs to establish CourtSI—the first sports-centric spatial intelligence dataset—and its high-quality evaluation benchmark, CourtSI-Bench. Comprehensive evaluation across 25 VLMs reveals a substantial performance gap between models and humans. Fine-tuning Qwen3-VL-8B yields a 23.5-percentage-point accuracy improvement and demonstrates strong generalization to unseen sports, along with compelling spatially aware commentary generation.

Technology Category

Application Category

📝 Abstract
Sports have long attracted broad attention as they push the limits of human physical and cognitive capabilities. Amid growing interest in spatial intelligence for vision-language models (VLMs), sports provide a natural testbed for understanding high-intensity human motion and dynamic object interactions. To this end, we present CourtSI, the first large-scale spatial intelligence dataset tailored to sports scenarios. CourtSI contains over 1M QA pairs, organized under a holistic taxonomy that systematically covers spatial counting, distance measurement, localization, and relational reasoning, across representative net sports including badminton, tennis, and table tennis. Leveraging well-defined court geometry as metric anchors, we develop a semi-automatic data engine to reconstruct sports scenes, enabling scalable curation of CourtSI. In addition, we introduce CourtSI-Bench, a high-quality evaluation benchmark comprising 3,686 QA pairs with rigorous human verification. We evaluate 25 proprietary and open-source VLMs on CourtSI-Bench, revealing a remaining human-AI performance gap and limited generalization from existing spatial intelligence benchmarks. These findings indicate that sports scenarios expose limitations in spatial intelligence capabilities captured by existing benchmarks. Further, fine-tuning Qwen3-VL-8B on CourtSI improves accuracy on CourtSI-Bench by 23.5 percentage points. The adapted model also generalizes effectively to CourtSI-Ext, an evaluation set built on a similar but unseen sport, and demonstrates enhanced spatial-aware commentary generation. Together, these findings demonstrate that CourtSI provides a scalable pathway toward advancing spatial intelligence of VLMs in sports.
Problem

Research questions and friction points this paper is trying to address.

spatial intelligence
vision-language models
sports scenarios
benchmarking
human motion
Innovation

Methods, ideas, or system contributions that make the work stand out.

spatial intelligence
vision-language models
sports benchmarking
semi-automatic data engine
CourtSI
🔎 Similar Papers
No similar papers found.
Y
Yuchen Yang
Fudan University, Shanghai Artificial Intelligence Laboratory
Y
Yuqing Shao
East China University of Science and Technology, Shanghai Artificial Intelligence Laboratory
D
Duxiu Huang
Southeast University
L
Linfeng Dong
Zhejiang University, Shanghai Artificial Intelligence Laboratory
Yifei Liu
Yifei Liu
PhD student, Shanghai AI Lab
Sparse ComputationAsynchronous ComputationToken/Gaussian Pruning
S
Suixin Tang
East China University of Science and Technology
Xiang Zhou
Xiang Zhou
Guangdong Institute of Intelligence Science and Technology
Single-cell mult-omicsSpatial mult-omics
Y
Yuanyuan Gao
Hong Kong University of Science and Technology, Shanghai Artificial Intelligence Laboratory
W
Wei Wang
Shanghai Artificial Intelligence Laboratory
Yue Zhou
Yue Zhou
Associate Professor, East China Normal University
Remote Sensing Vision-Language ModelOriented Object Detection
X
Xue Yang
Shanghai Jiao Tong University
Yanfeng Wang
Yanfeng Wang
Shanghai Jiao Tong University
Xiao Sun
Xiao Sun
Scientist, Shanghai AI Laboratory
Computer VisionMachine Learning
Zhihang Zhong
Zhihang Zhong
Researcher, Shanghai AI Laboratory
Computer visionDeep learning