🤖 AI Summary
This work addresses the lack of benchmarks for evaluating multimodal large language models’ (MLLMs) rule comprehension and hierarchical reasoning capabilities in sports contexts. To this end, we introduce SPORTU—the first benchmark dedicated to sports understanding and reasoning—comprising two tasks: text-only rule comprehension (SPORTU-text) and slow-motion video-based multi-level reasoning (SPORTU-video). Our methodology innovatively integrates rule cognition, strategic reasoning, and penalty recognition into a unified evaluation framework, featuring human-annotated multiple-choice questions and fine-grained video question-answering pairs. The benchmark spans seven sports and contains 12,048 high-quality QA instances. Evaluation employs few-shot learning and chain-of-thought prompting. Results show GPT-4o achieves 71% accuracy on SPORTU-text, while Claude-3.5-Sonnet attains only 52.6% on challenging SPORTU-video tasks—substantially below human performance—revealing critical limitations in MLLMs’ deep sports reasoning.
📝 Abstract
Multimodal Large Language Models (MLLMs) are advancing the ability to reason about complex sports scenarios by integrating textual and visual information. To comprehensively evaluate their capabilities, we introduce SPORTU, a benchmark designed to assess MLLMs across multi-level sports reasoning tasks. SPORTU comprises two key components: SPORTU-text, featuring 900 multiple-choice questions with human-annotated explanations for rule comprehension and strategy understanding. This component focuses on testing models' ability to reason about sports solely through question-answering (QA), without requiring visual inputs; SPORTU-video, consisting of 1,701 slow-motion video clips across 7 different sports and 12,048 QA pairs, designed to assess multi-level reasoning, from simple sports recognition to complex tasks like foul detection and rule application. We evaluate four prevalent LLMs mainly utilizing few-shot learning paradigms supplemented by chain-of-thought (CoT) prompting on the SPORTU-text part. We evaluate four LLMs using few-shot learning and chain-of-thought (CoT) prompting on SPORTU-text. GPT-4o achieves the highest accuracy of 71%, but still falls short of human-level performance, highlighting room for improvement in rule comprehension and reasoning. The evaluation for the SPORTU-video part includes 7 proprietary and 6 open-source MLLMs. Experiments show that models fall short on hard tasks that require deep reasoning and rule-based understanding. Claude-3.5-Sonnet performs the best with only 52.6% accuracy on the hard task, showing large room for improvement. We hope that SPORTU will serve as a critical step toward evaluating models' capabilities in sports understanding and reasoning.