EgoPoseFormer v2: Accurate Egocentric Human Motion Estimation for AR/VR

📅 2026-03-04

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This work addresses the challenges of first-person human motion estimation—limited body visibility, frequent occlusions, and scarce annotated data—by proposing a temporally consistent and spatially aligned Transformer architecture. The method integrates identity-conditioned queries, multi-view spatial refinement, and causal temporal attention, coupled with an uncertainty-aware semi-supervised auto-labeling strategy to effectively leverage large-scale unlabeled data. Through a fully differentiable design, teacher-student pseudo-label generation, uncertainty distillation, and fusion of parametric and keypoint-based representations, the model achieves state-of-the-art performance on EgoBody3M, surpassing existing methods by 12.2% in MPJPE and 19.4% in PA-MPJPE with only 0.8ms latency. Temporal jitter is reduced by 22.2% and 51.7%, respectively, and the auto-labeling scheme further improves wrist accuracy by 13.1%.

Technology Category

Application Category

📝 Abstract

Egocentric human motion estimation is essential for AR/VR experiences, yet remains challenging due to limited body coverage from the egocentric viewpoint, frequent occlusions, and scarce labeled data. We present EgoPoseFormer v2, a method that addresses these challenges through two key contributions: (1) a transformer-based model for temporally consistent and spatially grounded body pose estimation, and (2) an auto-labeling system that enables the use of large unlabeled datasets for training. Our model is fully differentiable, introduces identity-conditioned queries, multi-view spatial refinement, causal temporal attention, and supports both keypoints and parametric body representations under a constant compute budget. The auto-labeling system scales learning to tens of millions of unlabeled frames via uncertainty-aware semi-supervised training. The system follows a teacher-student schema to generate pseudo-labels and guide training with uncertainty distillation, enabling the model to generalize to different environments. On the EgoBody3M benchmark, with a 0.8 ms latency on GPU, our model outperforms two state-of-the-art methods by 12.2% and 19.4% in accuracy, and reduces temporal jitter by 22.2% and 51.7%. Furthermore, our auto-labeling system further improves the wrist MPJPE by 13.1%.

Problem

Research questions and friction points this paper is trying to address.

egocentric human motion estimation

AR/VR

occlusions

limited body coverage

scarce labeled data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Egocentric Pose Estimation

Transformer-based Model

Auto-labeling System

Uncertainty-aware Semi-supervised Learning

Causal Temporal Attention

🔎 Similar Papers

EMHI: A Multimodal Egocentric Human Motion Dataset with HMD and Body-Worn IMUs

2024-08-30AAAI Conference on Artificial IntelligenceCitations: 5