Parameter-Efficient Multi-View Proficiency Estimation: From Discriminative Classification to Generative Feedback

📅 2026-05-05
📈 Citations: 0
Influential: 0
📄 PDF

career value

180K/year
🤖 AI Summary
Accurately assessing the proficiency of action execution—rather than merely recognizing action categories—is critical for coaching, rehabilitation, and talent selection. This work proposes a multi-view video analysis framework that advances skill evaluation beyond closed-set classification by introducing generative expert feedback. The approach integrates SkillFormer, a selective multi-view fusion architecture; PATS, a local dense temporal sampling strategy; and ProfVLM, a conditional language generation model, augmented with gated cross-view projection and a lightweight language backbone to enable efficient training and interpretable outputs. Evaluated on Ego-Exo4D, the method achieves state-of-the-art performance while reducing trainable parameters to 1/20 of the baseline and shortening training epochs to one-third, all while producing actionable, expert-level feedback.
📝 Abstract
Estimating how well a person performs an action, rather than which action is performed, is central to coaching, rehabilitation, and talent identification. This task is challenging because proficiency is encoded in subtle differences in timing, balance, body mechanics, and execution, often distributed across multiple views and short temporal events. We discuss three recent contributions to multi-view proficiency estimation on Ego-Exo4D. SkillFormer introduces a parameter-efficient discriminative architecture for selective multi-view fusion; PATS improves temporal sampling by preserving locally dense excerpts of fundamental movements; and ProfVLM reformulates proficiency estimation as conditional language generation, producing both a proficiency label and expert-style feedback through a gated cross-view projector and a compact language backbone. Together, these methods achieve state-of-the-art accuracy on Ego-Exo4D with up to 20x fewer trainable parameters and up to 3x fewer training epochs than video-transformer baselines, while moving from closed-set classification toward interpretable feedback generation. These results highlight a shift toward efficient, multi-view systems that combine selective fusion, proficiency-aware sampling, and actionable generative feedback.
Problem

Research questions and friction points this paper is trying to address.

proficiency estimation
multi-view
action assessment
fine-grained analysis
video understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

parameter-efficient learning
multi-view fusion
proficiency estimation
generative feedback
temporal sampling