Hand-Aware Egocentric Motion Reconstruction with Sequence-Level Context

📅 2025-12-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Estimating full-body 3D motion from egocentric video remains challenging due to severe self-occlusions, intermittent hand visibility, and unreliable head trajectory estimation. To address these issues, we propose HaMoS—the first hand-aware sequential diffusion model tailored for egocentric settings. Methodologically, HaMoS introduces real-field-of-view (FOV) constraints and occlusion-aware data augmentation; designs a local attention mechanism coupled with multimodal conditional modeling—integrating head motion trajectories and sparse hand observations; and leverages sequence-level priors (e.g., body shape, FOV) to enhance long-horizon inference robustness. Evaluated on public benchmarks, HaMoS achieves state-of-the-art accuracy and temporal smoothness, significantly improving reliability and generalization of egocentric 3D human motion reconstruction in unconstrained, “in-the-wild” scenarios.

Technology Category

Application Category

📝 Abstract
Egocentric vision systems are becoming widely available, creating new opportunities for human-computer interaction. A core challenge is estimating the wearer's full-body motion from first-person videos, which is crucial for understanding human behavior. However, this task is difficult since most body parts are invisible from the egocentric view. Prior approaches mainly rely on head trajectories, leading to ambiguity, or assume continuously tracked hands, which is unrealistic for lightweight egocentric devices. In this work, we present HaMoS, the first hand-aware, sequence-level diffusion framework that directly conditions on both head trajectory and intermittently visible hand cues caused by field-of-view limitations and occlusions, as in real-world egocentric devices. To overcome the lack of datasets pairing diverse camera views with human motion, we introduce a novel augmentation method that models such real-world conditions. We also demonstrate that sequence-level contexts such as body shape and field-of-view are crucial for accurate motion reconstruction, and thus employ local attention to infer long sequences efficiently. Experiments on public benchmarks show that our method achieves state-of-the-art accuracy and temporal smoothness, demonstrating a practical step toward reliable in-the-wild egocentric 3D motion understanding.
Problem

Research questions and friction points this paper is trying to address.

Estimates full-body motion from first-person videos
Uses intermittent hand cues and head trajectory
Models real-world conditions for accurate reconstruction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hand-aware diffusion framework uses head trajectory and hand cues
Novel augmentation method models real-world camera view limitations
Sequence-level context and local attention improve motion reconstruction
🔎 Similar Papers
No similar papers found.