Learning Consistent Temporal Grounding between Related Tasks in Sports Coaching

📅 2026-03-18

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This work addresses the susceptibility of video large language models to irrelevant frames in sports instruction tasks, which impairs temporal localization accuracy, while acknowledging that frame-level annotations are costly and unreliable. The authors propose a novel self-consistency objective function that leverages the intrinsic alignment between closely related tasks—such as generation and verification—which should attend to the same critical frames. By enforcing inter-task temporal attention consistency as a self-supervised signal, the method optimizes visual attention allocation without requiring additional annotations. Evaluated through attention map–based self-consistency constraints, model fine-tuning, and benchmarking on VidDiffBench, the approach achieves accuracy gains of 3.0% and 14.1% on Exact, FitnessQA, and ExpertAF tasks, along with a 0.9 improvement in BERTScore, outperforming existing closed-source models.

Technology Category

Application Category

📝 Abstract

Video-LLMs often attend to irrelevant frames, which is especially detrimental for sports coaching tasks requiring precise temporal grounding. Yet obtaining frame-level supervision is challenging: expensive to collect from humans and unreliable from other models. We improve temporal grounding without additional annotations by exploiting the observation that related tasks, such as generation and verification, must attend to the same frames. We enforce this via a self-consistency objective over select visual attention maps of tightly-related tasks. Using VidDiffBench, which provides ground-truth keyframe annotations, we first validate that attention misallocation is a significant bottleneck. We then show that training with our objective yields gains of +3.0%, +14.1% accuracy and +0.9 BERTScore over supervised finetuning across three sports coaching tasks: Exact, FitnessQA, and ExpertAF, even surpassing closed-source models.

Problem

Research questions and friction points this paper is trying to address.

temporal grounding

sports coaching

video-LLMs

frame-level supervision

attention misallocation

Innovation

Methods, ideas, or system contributions that make the work stand out.

temporal grounding

self-consistency

video-LLMs