SkillFormer: Unified Multi-View Video Understanding for Proficiency Estimation

📅 2025-05-13

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

Assessing fine-grained human skill proficiency in complex activities (e.g., sports, rehabilitation) from multi-view video—particularly integrating egocentric and exocentric perspectives—remains challenging. To address this, we propose the first unified proficiency modeling framework for multi-view video. Our method introduces CrossViewFusion, a novel module that jointly leverages multi-head cross-view attention, learnable dynamic gating, and adaptive self-calibration to effectively fuse complementary spatiotemporal cues across viewpoints. We integrate this module into the TimeSformer architecture and employ LoRA for parameter-efficient fine-tuning. Evaluated on the EgoExo4D benchmark, our approach achieves state-of-the-art performance: it improves accuracy while reducing model parameters by 4.5× and training epochs by 3.75× compared to prior methods. This yields significant gains in computational efficiency and cross-scenario generalization capability.

Technology Category

Application Category

📝 Abstract

Assessing human skill levels in complex activities is a challenging problem with applications in sports, rehabilitation, and training. In this work, we present SkillFormer, a parameter-efficient architecture for unified multi-view proficiency estimation from egocentric and exocentric videos. Building on the TimeSformer backbone, SkillFormer introduces a CrossViewFusion module that fuses view-specific features using multi-head cross-attention, learnable gating, and adaptive self-calibration. We leverage Low-Rank Adaptation to fine-tune only a small subset of parameters, significantly reducing training costs. In fact, when evaluated on the EgoExo4D dataset, SkillFormer achieves state-of-the-art accuracy in multi-view settings while demonstrating remarkable computational efficiency, using 4.5x fewer parameters and requiring 3.75x fewer training epochs than prior baselines. It excels in multiple structured tasks, confirming the value of multi-view integration for fine-grained skill assessment.

Problem

Research questions and friction points this paper is trying to address.

Estimating human skill levels in complex activities

Unified proficiency estimation from multi-view videos

Reducing training costs with parameter-efficient architecture

Innovation

Methods, ideas, or system contributions that make the work stand out.

CrossViewFusion module for multi-view feature fusion

Low-Rank Adaptation for parameter-efficient fine-tuning

Unified architecture for egocentric and exocentric videos

🔎 Similar Papers

No similar papers found.