🤖 AI Summary
To address the high power consumption caused by continuous video processing in first-person vision (FPV) skill assessment on smart glasses, this paper proposes a two-stage framework: eye movement–video joint modeling followed by knowledge distillation into a pure eye movement–based student model. We systematically validate, for the first time, the discriminative value of eye movement signals for cross-domain skill understanding (cooking, music, sports), and design a lightweight student model accepting only eye movement input. By eliminating real-time video encoding and analysis, our method achieves state-of-the-art accuracy while reducing power consumption by 73×, significantly enhancing deployability on resource-constrained edge devices. The core contributions are: (1) establishing an eye movement–dominant paradigm for FPV skill understanding; and (2) enabling efficient multimodal-to-unimodal knowledge transfer.
📝 Abstract
Egocentric perception on smart glasses could transform how we learn new skills in the physical world, but automatic skill assessment remains a fundamental technical challenge. We introduce SkillSight for power-efficient skill assessment from first-person data. Central to our approach is the hypothesis that skill level is evident not only in how a person performs an activity (video), but also in how they direct their attention when doing so (gaze). Our two-stage framework first learns to jointly model gaze and egocentric video when predicting skill level, then distills a gaze-only student model. At inference, the student model requires only gaze input, drastically reducing power consumption by eliminating continuous video processing. Experiments on three datasets spanning cooking, music, and sports establish, for the first time, the valuable role of gaze in skill understanding across diverse real-world settings. Our SkillSight teacher model achieves state-of-the-art performance, while our gaze-only student variant maintains high accuracy using 73x less power than competing methods. These results pave the way for in-the-wild AI-supported skill learning.