Towards Automatic Soccer Commentary Generation with Knowledge-Enhanced Visual Reasoning

📅 2026-03-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing automatically generated football commentaries often fail to meet live-broadcast requirements due to ambiguous entity references, contextual errors, and a lack of match statistics. This work proposes GameSight, the first framework to integrate knowledge enhancement with visual reasoning through a two-stage model: it first aligns anonymous players via fine-grained visual and contextual analysis, then fuses external historical statistics with dynamic game states to generate commentary. This approach substantially improves entity alignment accuracy, contextual relevance, and structural coherence of the generated narratives. Evaluated on the SN-Caption-test-align dataset, GameSight achieves an 18.5% higher player alignment accuracy compared to Gemini 2.5 Pro and demonstrates superior performance in paragraph-level correctness, commentary quality, and global coherence.
📝 Abstract
Soccer commentary plays a crucial role in enhancing the soccer game viewing experience for audiences. Previous studies in automatic soccer commentary generation typically adopt an end-to-end method to generate anonymous live text commentary. Such generated commentary is insufficient in the context of real-world live televised commentary, as it contains anonymous entities, context-dependent errors and lacks statistical insights of the game events. To bridge the gap, we propose GameSight, a two-stage model to address soccer commentary generation as a knowledge-enhanced visual reasoning task, enabling live-televised-like knowledgeable commentary with accurate reference to entities (players and teams). GameSight starts by performing visual reasoning to align anonymous entities with fine-grained visual and contextual analysis. Subsequently, the entity-aligned commentary is refined with knowledge by incorporating external historical statistics and iteratively updated internal game state information. Consequently, GameSight improves the player alignment accuracy by 18.5% on SN-Caption-test-align dataset compared to Gemini 2.5-pro. Combined with further knowledge enhancement, GameSight outperforms in segment-level accuracy and commentary quality, as well as game-level contextual relevance and structural composition. We believe that our work paves the way for a more informative and engaging human-centric experience with the AI sports application. Demo Page: https://gamesight2025.github.io/gamesight2025
Problem

Research questions and friction points this paper is trying to address.

soccer commentary generation
entity alignment
knowledge-enhanced reasoning
visual reasoning
contextual relevance
Innovation

Methods, ideas, or system contributions that make the work stand out.

knowledge-enhanced visual reasoning
entity alignment
automatic sports commentary
two-stage generation model
contextual game state
🔎 Similar Papers
No similar papers found.
Zeyu Jin
Zeyu Jin
Adobe Research
Speech and audio processingDeep Learning
Xiaoyu Qin
Xiaoyu Qin
Tsinghua University
Artificial Intelligence
Songtao Zhou
Songtao Zhou
Tsinghua University
MultimediaSpeech SynthesisMultimodal Generation
K
Kaifeng Yun
Department of Computer Science and Technology, Tsinghua University
J
Jia Jia
Department of Computer Science and Technology, Tsinghua University; BNRist, Tsinghua University