🤖 AI Summary
To address the challenges of weak domain understanding and shallow reasoning in large language models (LLMs) for sports video question answering (VideoQA), this paper proposes a training-free dual-mode reasoning framework that synergistically integrates reactive and deliberative reasoning. We construct SSGraph—the first multimodal sports knowledge graph covering nine sports—to enhance domain-specific semantics. Inspired by cognitive science, we introduce a novel “thinking agent” architecture that jointly performs visual instance recognition and domain terminology alignment for knowledge grounding. Additionally, we propose a zero-shot multimodal scene graph modeling method to capture spatiotemporal relations in sports videos. Based on this framework, we release two new benchmarks: Gym-QA and Diving-QA. Our approach achieves state-of-the-art performance on Gym-QA, Diving-QA, and SPORTU, while preserving strong generalization across standard VideoQA tasks.
📝 Abstract
Video Question Answering (VideoQA) based on Large Language Models (LLMs) has shown potential in general video understanding but faces significant challenges when applied to the inherently complex domain of sports videos. In this work, we propose FineQuest, the first training-free framework that leverages dual-mode reasoning inspired by cognitive science: i) Reactive Reasoning for straightforward sports queries and ii) Deliberative Reasoning for more complex ones. To bridge the knowledge gap between general-purpose models and domain-specific sports understanding, FineQuest incorporates SSGraph, a multimodal sports knowledge scene graph spanning nine sports, which encodes both visual instances and domain-specific terminology to enhance reasoning accuracy. Furthermore, we introduce two new sports VideoQA benchmarks, Gym-QA and Diving-QA, derived from the FineGym and FineDiving datasets, enabling diverse and comprehensive evaluation. FineQuest achieves state-of-the-art performance on these benchmarks as well as the existing SPORTU dataset, while maintains strong general VideoQA capabilities.