ExpertAF: Expert Actionable Feedback from Video

📅 2024-08-01

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Existing video-based skill assessment methods produce only scalar scores or coarse comparative rankings, lacking actionable improvement suggestions. To address this, we propose the first end-to-end framework for generating multimodal expert feedback (AF) from motion videos and 3D poses—jointly producing natural-language critiques (highlighting strengths and weaknesses) and corrected visual demonstrations (via pose regeneration and video retrieval). Our method adopts a weakly supervised paradigm, constructing a training set atop Ego-Exo4D and unifying video-language modeling to jointly optimize critique generation, pose correction, and demonstration video retrieval. Quantitative evaluation and human preference studies demonstrate significant improvements over strong baselines. The code and dataset will be publicly released.

Technology Category

Application Category

📝 Abstract

Feedback is essential for learning a new skill or improving one's current skill-level. However, current methods for skill-assessment from video only provide scores or compare demonstrations, leaving the burden of knowing what to do differently on the user. We introduce a novel method to generate actionable feedback from video of a person doing a physical activity, such as basketball or soccer. Our method takes a video demonstration and its accompanying 3D body pose and generates (1) free-form expert commentary describing what the person is doing well and what they could improve, and (2) a visual expert demonstration that incorporates the required corrections. We show how to leverage Ego-Exo4D's videos of skilled activity and expert commentary together with a strong language model to create a weakly-supervised training dataset for this task, and we devise a multimodal video-language model to infer coaching feedback. Our method is able to reason across multi-modal input combinations to output full-spectrum, actionable coaching -- expert commentary, expert video retrieval, and expert pose generation -- outperforming strong vision-language models on both established metrics and human preference studies. Code and data will be publicly released.

Problem

Research questions and friction points this paper is trying to address.

Generates actionable feedback from physical activity videos

Provides expert commentary and visual corrections

Uses multimodal input for coaching feedback

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates actionable feedback from video

Uses 3D body pose for expert commentary

Multimodal model for coaching feedback

🔎 Similar Papers

No similar papers found.