SportR: A Benchmark for Multimodal Large Language Model Reasoning in Sports

📅 2025-11-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current multimodal models exhibit limited fine-grained visual perception and rule-based reasoning capabilities in sports understanding, and lack a comprehensive benchmark covering multiple sports, enabling multi-level reasoning, and supporting precise visual grounding. To address this, we introduce SportR—the first large-scale, multi-sport benchmark spanning both image and video modalities. SportR innovatively incorporates a progressive question-answering hierarchy, human-annotated chain-of-thought rationales, and bounding-box grounding for visual localization, enabling holistic evaluation from factual recognition to rule-based inference. Leveraging SportR, we jointly train multimodal large language models via supervised fine-tuning and reinforcement learning. Extensive experiments reveal significant performance bottlenecks in state-of-the-art models across all SportR tasks, underscoring the benchmark’s utility in diagnosing technical limitations and advancing sports intelligence research.

Technology Category

Application Category

📝 Abstract
Deeply understanding sports requires an intricate blend of fine-grained visual perception and rule-based reasoning - a challenge that pushes the limits of current multimodal models. To succeed, models must master three critical capabilities: perceiving nuanced visual details, applying abstract sport rule knowledge, and grounding that knowledge in specific visual evidence. Current sports benchmarks either cover single sports or lack the detailed reasoning chains and precise visual grounding needed to robustly evaluate these core capabilities in a multi-sport context. To address this gap, we introduce SportR, the first multi-sports large-scale benchmark designed to train and evaluate MLLMs on the fundamental reasoning required for sports intelligence. Our benchmark provides a dataset of 5,017 images and 2,101 videos. To enable granular evaluation, we structure our benchmark around a progressive hierarchy of question-answer (QA) pairs designed to probe reasoning at increasing depths - from simple infraction identification to complex penalty prediction. For the most advanced tasks requiring multi-step reasoning, such as determining penalties or explaining tactics, we provide 7,118 high-quality, human-authored Chain of Thought (CoT) annotations. In addition, our benchmark incorporates both image and video modalities and provides manual bounding box annotations to test visual grounding in the image part directly. Extensive experiments demonstrate the profound difficulty of our benchmark. State-of-the-art baseline models perform poorly on our most challenging tasks. While training on our data via Supervised Fine-Tuning and Reinforcement Learning improves these scores, they remain relatively low, highlighting a significant gap in current model capabilities. SportR presents a new challenge for the community, providing a critical resource to drive future research in multimodal sports reasoning.
Problem

Research questions and friction points this paper is trying to address.

Benchmarking multimodal reasoning in sports using visual perception and rules
Evaluating nuanced visual details and abstract sport knowledge application
Addressing multi-step reasoning gaps in current multimodal sports models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-sports benchmark with multimodal data
Progressive QA hierarchy for granular evaluation
Human-authored Chain of Thought annotations
Haotian Xia
Haotian Xia
Rice University
Natural Language ProcessingSports Analytics
Haonan Ge
Haonan Ge
Southeast University
Vision Language Model
J
Junbo Zou
Georgia Institute of Technology
H
Hyun Woo Choi
University of California, Irvine
Xuebin Zhang
Xuebin Zhang
University of California, Irvine
D
Danny Suradja
University of California, Irvine
B
Botao Rui
University of California, Irvine
E
Ethan Tran
University of California, Irvine
W
Wendy Jin
Rice University
Z
Zhen Ye
Johns Hopkins University
X
Xiyang Lin
University of California, Irvine
C
Christopher Lai
University of California, Santa Barbara
Shengjie Zhang
Shengjie Zhang
Beihang University
Multimodal content analysisIntelligent Airport Management Systems
J
Junwen Miao
University of California, Irvine
S
Shichao Chen
Rice University
Rhys Tracy
Rhys Tracy
University of California, Santa Barbara
V
Vicente Ordonez
Rice University
Weining Shen
Weining Shen
Associate Professor of Statistics, University of California, Irvine
StatisticsMachine learningBiostatistics
H
Hanjie Chen
Rice University