🤖 AI Summary
This work addresses the scarcity of fine-grained sports skill coverage in existing large-scale video datasets, which hinders physical skill acquisition. To bridge this gap, we introduce SportSkills, the first large-scale instructional sports video dataset tailored for skill learning, encompassing 55 sports and over 360,000 commentary-annotated teaching videos. We further propose a novel task—error-aware instructional video retrieval—that enables personalized coaching recommendations based on a user’s action performance. By integrating visual action representations with natural language narration through contrastive learning and multimodal alignment, our model achieves up to a fourfold performance improvement over conventional activity recognition datasets under the same architecture. Expert coach evaluations confirm that the system substantially enhances personalized visual guidance, establishing the first benchmark for sports skill learning.
📝 Abstract
Current large-scale video datasets focus on general human activity, but lack depth of coverage on fine-grained activities needed to address physical skill learning. We introduce SportSkills, the first large-scale sports dataset geared towards physical skill learning with in-the-wild video. SportSkills has more than 360k instructional videos containing more than 630k visual demonstrations paired with instructional narrations explaining the know-how behind the actions from 55 varied sports. Through a suite of experiments, we show that SportSkills unlocks the ability to understand fine-grained differences between physical actions. Our representation achieves gains of up to 4x with the same model trained on traditional activity-centric datasets. Crucially, building on SportSkills, we introduce the first large-scale task formulation of mistake-conditioned instructional video retrieval, bridging representation learning and actionable feedback generation (e.g., "here's my execution of a skill; which video clip should I watch to improve it?"). Formal evaluations by professional coaches show our retrieval approach significantly advances the ability of video models to personalize visual instructions for a user query.