EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents

📅 2026-02-26

📈 Citations: 0

✨ Influential: 0

career value

233K/year

🤖 AI Summary

This work addresses the challenge of capturing scene-aware human motion in natural environments, where existing motion capture systems are hindered by reliance on expensive equipment and wearable sensors. The authors propose a portable data acquisition pipeline using two handheld iPhones, enabling markerless, camera-agnostic 4D human-scene reconstruction in the wild. By jointly calibrating dual-view RGB-D sequences into a unified metric coordinate system, their method achieves metrically consistent human-scene reconstruction under low-cost, unconstrained conditions—surmounting the hardware and environmental limitations of conventional motion capture. Experimental results demonstrate superior reconstruction accuracy compared to monocular or single-iPhone approaches and validate its effectiveness across three embodied intelligence tasks: monocular human-scene reconstruction, physics-driven animation synthesis, and humanoid robot motion control.

Technology Category

Application Category

📝 Abstract

Human behaviors in the real world naturally encode rich, long-term contextual information that can be leveraged to train embodied agents for perception, understanding, and acting. However, existing capture systems typically rely on costly studio setups and wearable devices, limiting the large-scale collection of scene-conditioned human motion data in the wild. To address this, we propose EmbodMocap, a portable and affordable data collection pipeline using two moving iPhones. Our key idea is to jointly calibrate dual RGB-D sequences to reconstruct both humans and scenes within a unified metric world coordinate frame. The proposed method allows metric-scale and scene-consistent capture in everyday environments without static cameras or markers, bridging human motion and scene geometry seamlessly. Compared with optical capture ground truth, we demonstrate that the dual-view setting exhibits a remarkable ability to mitigate depth ambiguity, achieving superior alignment and reconstruction performance over single iphone or monocular models. Based on the collected data, we empower three embodied AI tasks: monocular human-scene-reconstruction, where we fine-tune on feedforward models that output metric-scale, world-space aligned humans and scenes; physics-based character animation, where we prove our data could be used to scale human-object interaction skills and scene-aware motion tracking; and robot motion control, where we train a humanoid robot via sim-to-real RL to replicate human motions depicted in videos. Experimental results validate the effectiveness of our pipeline and its contributions towards advancing embodied AI research.

Problem

Research questions and friction points this paper is trying to address.

embodied agents

human-scene reconstruction

in-the-wild motion capture

scene-conditioned human motion

4D reconstruction

Innovation

Methods, ideas, or system contributions that make the work stand out.

4D human-scene reconstruction

dual-view RGB-D calibration

embodied AI

in-the-wild motion capture

metric-scale alignment

🔎 Similar Papers

No similar papers found.