HMD2: Environment-aware Motion Generation from Single Egocentric Head-Mounted Device

📅 2024-09-20
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
To address the inherent ambiguity in full-body human motion reconstruction from a single head-mounted device (forward-facing RGB camera + visual SLAM), this paper proposes a reconstruction-generation co-design framework. First, it fuses SLAM point clouds, estimated head motion, and image embeddings to construct a multimodal conditioning signal. Second, it introduces a Transformer-based conditional motion diffusion model augmented with an autoregressive motion inpainting mechanism, enabling low-latency online inference (0.17 s). The method innovatively unifies geometric priors with generative modeling—leveraging both explicit 3D structure and implicit motion statistics. Evaluated on over 200 hours of diverse indoor/outdoor motion data, it significantly improves spatiotemporal coherence, physical plausibility, and environmental interaction fidelity compared to prior approaches.

Technology Category

Application Category

📝 Abstract
This paper investigates the generation of realistic full-body human motion using a single head-mounted device with an outward-facing color camera and the ability to perform visual SLAM. To address the ambiguity of this setup, we present HMD^2, a novel system that balances motion reconstruction and generation. From a reconstruction standpoint, it aims to maximally utilize the camera streams to produce both analytical and learned features, including head motion, SLAM point cloud, and image embeddings. On the generative front, HMD^2 employs a multi-modal conditional motion diffusion model with a Transformer backbone to maintain temporal coherence of generated motions, and utilizes autoregressive inpainting to facilitate online motion inference with minimal latency (0.17 seconds). We show that our system provides an effective and robust solution that scales to a diverse dataset of over 200 hours of motion in complex indoor and outdoor environments.
Problem

Research questions and friction points this paper is trying to address.

Generates realistic full-body motion from single head-mounted device
Balances motion reconstruction and generation using multi-modal features
Ensures temporal coherence and low latency in motion inference
Innovation

Methods, ideas, or system contributions that make the work stand out.

Utilizes single head-mounted device with camera and SLAM
Employs multi-modal conditional motion diffusion model
Achieves online motion inference with minimal latency
🔎 Similar Papers
No similar papers found.