EggHand: A Multimodal Foundation Model for Egocentric Hand Pose Forecasting

📅 2026-05-08

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Predicting future 3D hand poses from first-person videos is challenging due to complex intentions, dexterous motions, and severe viewpoint changes caused by ego-motion. This work proposes EggHand, a novel framework that introduces multimodal foundation models into egocentric hand pose forecasting for the first time. EggHand integrates a large-scale pretrained video-text encoder with a vision-language-action (VLA) decoder to jointly model hand dynamics, contextual semantics, and high-level intent—without relying on body pose priors or external trackers. The approach enables viewpoint-aware semantic reasoning and language-guided controllable prediction, achieving state-of-the-art performance on EgoExo4D and demonstrating strong robustness against aggressive ego-motion.

📝 Abstract

Forecasting future 3D hand pose sequences from egocentric video is essential for understanding human intention and enabling embodied applications such as AR/VR assistance and human-robot interaction. However, this task remains a highly challenging problem because egocentric hand motion is driven by complex human intent, exhibits highly dexterous articulations, and is observed under drastic viewpoint shifts induced by ego-motion. In this work, we introduce EggHand, a foundation-model-based framework for egocentric hand pose forecasting that unifies multimodal semantic reasoning with dynamic motion modeling. Our approach couples an action decoder from a Vision-Language-Action (VLA) model, which captures the structured temporal dynamics of hand motion, with an egocentric video-text encoder that provides viewpoint-aware contextual information learned from large-scale first-person video. Together, these components overcome the brittleness of generic visual encoders under ego-motion and enable joint reasoning over motion, context, and high-level intent-without relying on body pose or external tracking. Experiments on the EgoExo4D dataset show that EggHand sets a new state of the art in forecasting accuracy, remains robust under severe ego-motion, and further enables controllable prediction via language-based task prompts. Project page: https://jyoun9.github.io/EggHand

Problem

Research questions and friction points this paper is trying to address.

egocentric vision

hand pose forecasting

3D hand motion

human intention understanding

embodied AI

Innovation

Methods, ideas, or system contributions that make the work stand out.

egocentric hand pose forecasting

multimodal foundation model

vision-language-action (VLA)