EMHI: A Multimodal Egocentric Human Motion Dataset with HMD and Body-Worn IMUs

📅 2024-08-30

🏛️ AAAI Conference on Artificial Intelligence

📈 Citations: 5

✨ Influential: 0

career value

232K/year

🤖 AI Summary

Monomodal methods for egocentric human pose estimation (HPE) suffer from image self-occlusion (vision) or IMU drift/sparsity (inertial sensing), while the absence of synchronized multimodal datasets captured in realistic VR environments hinders progress. Method: We introduce EMHI—the first multimodal motion dataset tailored for VR products—featuring synchronized HMD stereo vision and full-body IMU recordings, comprising 885 sequences, 39 action classes, and high-fidelity SMPL annotations validated against optical motion capture. Building upon EMHI, we propose MEPoser, an end-to-end multimodal fusion model that jointly performs cross-modal temporal alignment and pose regression. Contribution/Results: Experiments demonstrate that MEPoser significantly outperforms monomodal baselines on EMHI, validating that multimodal fusion is both effective and essential for improving robustness and accuracy in egocentric HPE.

Technology Category

Application Category

📝 Abstract

Egocentric human pose estimation (HPE) using wearable sensors is essential for VR/AR applications. Most methods rely solely on either egocentric-view images or sparse Inertial Measurement Unit (IMU) signals, leading to inaccuracies due to self-occlusion in images or the sparseness and drift of inertial sensors. Most importantly, the lack of real-world datasets containing both modalities is a major obstacle to progress in this field. To overcome the barrier, we propose EMHI, a multimodal Egocentric human Motion dataset with Head-Mounted Display (HMD) and body-worn IMUs, with all data collected under the real VR product suite. Specifically, EMHI provides synchronized stereo images from downward-sloping cameras on the headset and IMU data from body-worn sensors, along with pose annotations in SMPL format. This dataset consists of 885 sequences captured by 58 subjects performing 39 actions, totaling about 28.5 hours of recording. We evaluate the annotations by comparing them with optical marker-based SMPL fitting results. To substantiate the reliability of our dataset, we introduce MEPoser, a new baseline method for multimodal egocentric HPE, which employs a multimodal fusion encoder, temporal feature encoder, and MLP-based regression heads. The experiments on EMHI show that MEPoser outperforms existing single-modal methods and demonstrates the value of our dataset in solving the problem of egocentric HPE. We believe the release of EMHI and the method could advance the research of egocentric HPE and expedite the practical implementation of this technology in VR/AR products.

Problem

Research questions and friction points this paper is trying to address.

Lack of real-world multimodal datasets combining egocentric images and IMU data

Inaccurate human pose estimation due to self-occlusion or sensor drift limitations

Need for synchronized headset camera and body-worn sensor data for VR/AR applications

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal fusion of headset cameras and body-worn IMUs

Real-world dataset with synchronized stereo images and inertial data

Baseline method combining temporal encoding with SMPL pose regression

🔎 Similar Papers

Motion Capture from Inertial and Vision Sensors

2024-07-23arXiv.orgCitations: 5

ByteDance

San Jose

Research Scientist Intern, Machine Perception for Input and Interaction (PhD)