SoccerLens: Grounded Soccer Video Understanding Beyond Accuracy

📅 2026-05-10

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

This work addresses the limitation of existing football video understanding models, which rely heavily on classification accuracy without verifying whether predictions are grounded in genuine visual evidence. To bridge this gap, we introduce SoccerLens, a new benchmark encompassing 13 football event categories, accompanied by a three-tiered annotation framework for semantically relevant visual cues. We propose the first fine-grained visual grounding evaluation framework tailored for football videos, integrating an extended spatiotemporal attention attribution method with novel spatiotemporal alignment metrics to quantitatively assess the consistency between model attention and human-annotated cues. Experiments reveal that even under the most lenient criteria, state-of-the-art vision-language models achieve grounding performance below 50% and consistently neglect temporal information, exposing a significant discrepancy between high classification accuracy and authentic visual understanding.

📝 Abstract

Vision-language models (VLMs) have recently shown strong potential in soccer video understanding. However, given the high complexity of soccer videos due to large viewpoint variations, rapid shot transitions, and cluttered scenes, it remains unclear on whether VLMs rely on meaningful visual evidence or exploit spurious correlations and shortcut learning. Existing evaluation protocols focus primarily on classification accuracy and do not assess visual grounding. To address this limitation, we introduce SoccerLens, a benchmark for grounded soccer video understanding. The benchmark contains annotated video segments spanning $13$ common soccer events, with structured visual cues organized into three levels of semantic relevance. We further extend the attribution method of Chefer [arXiv:2103.15679] to jointly model spatial and temporal attention, and introduce evaluation metrics that measure whether model attention aligns with annotated cues or drifts toward spurious regions. Our evaluation of state-of-the-art soccer VLMs shows that, despite strong classification accuracy, current models fail to exceed $50\%$ grounding performance even under the loosest cue definitions and consistently underutilize temporal information. These results reveal a substantial gap between predictive performance and true visual grounding, highlighting the need for grounded evaluation in complex spatio-temporal domains such as soccer.

Problem

Research questions and friction points this paper is trying to address.

visual grounding

soccer video understanding

vision-language models

spurious correlations

evaluation benchmark

Innovation

Methods, ideas, or system contributions that make the work stand out.

visual grounding

soccer video understanding

vision-language models