🤖 AI Summary
This study addresses the limited generalization capability of current video large language models in sports feedback generation, which often rely on scarce and expensive domain-specific fine-tuning data, hindering their transfer to new athletic disciplines. Taking rock climbing as a case study, the work pioneers the integration of freely available multimodal resources from the target domain—such as public competition videos and coaching manuals—with feedback data from a source domain, leveraging cross-domain transfer learning to enhance feedback quality under low-label conditions. The paper further introduces novel evaluation metrics, “specificity” and “actionability,” which move beyond conventional text-based measures to more accurately assess the practical utility and instructional value of generated sports feedback.
📝 Abstract
While there is rapid progress in video-LLMs with advanced reasoning capabilities, prior work shows that these models struggle on the challenging task of sports feedback generation and require expensive and difficult-to-collect finetuning feedback data for each sport. This limitation is evident from the poor generalization to sports unseen during finetuning. Furthermore, traditional text generation evaluation metrics (e.g., BLEU-4, METEOR, ROUGE-L, BERTScore), originally developed for machine translation and summarization, fail to capture the unique aspects of sports feedback quality. To address the first problem, using rock climbing as our case study, we propose using auxiliary freely-available web data from the target domain, such as competition videos and coaching manuals, in addition to existing sports feedback from a disjoint, source domain to improve sports feedback generation performance on the target domain. To improve evaluation, we propose two evaluation metrics: (1) specificity and (2) actionability. Together, our approach enables more meaningful and practical generation of sports feedback under limited annotations.