🤖 AI Summary
Existing skill assessment methods predominantly rely on black-box video classifiers, neglecting multi-view contextual information and lacking interpretability. This paper proposes a generative vision-language reasoning framework that jointly models skill proficiency assessment as hierarchical rating prediction and natural language feedback generation. We introduce an innovative lightweight AttentiveGatedProjector module to dynamically fuse objective and subjective multi-view video features, enabling cross-modal alignment and transparent, interpretable reasoning. Temporal-spatial features are extracted using a frozen TimeSformer backbone, while a fine-tuned large language model generates expert-level qualitative feedback. Evaluated on the EgoExo4D benchmark, our method reduces parameter count by 20× and training time by 60%, while achieving significant accuracy gains. Crucially, it produces highly consistent, behavior-grounded, and human-interpretable textual feedback—bridging the gap between quantitative assessment and qualitative expertise.
📝 Abstract
Existing approaches to skill proficiency estimation often rely on black-box video classifiers, ignoring multi-view context and lacking explainability. We present ProfVLM, a compact vision-language model that reformulates this task as generative reasoning: it jointly predicts skill level and generates expert-like feedback from egocentric and exocentric videos. Central to our method is an AttentiveGatedProjector that dynamically fuses multi-view features, projected from a frozen TimeSformer backbone into a language model tuned for feedback generation. Trained on EgoExo4D with expert commentaries, ProfVLM surpasses state-of-the-art methods while using up to 20x fewer parameters and reducing training time by up to 60%. Our approach not only achieves superior accuracy across diverse activities, but also outputs natural language critiques aligned with performance, offering transparent reasoning. These results highlight generative vision-language modeling as a powerful new direction for skill assessment.