World-Coordinate Human Motion Retargeting via SAM 3D Body

📅 2025-12-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the problem of monocular video-to-humanoid-robot motion transfer, aiming to directly reconstruct human motion in world coordinates and enable real-time robot control. Methodologically, we propose Momentum HumanRig—a novel intermediate representation—and freeze SAM 3D Body as the perception backbone. We introduce the first soft foot-ground contact model, integrated with contact-aware global optimization and a two-stage kinematic-perception inverse-dynamics framework. To ensure efficiency and robustness, we employ sliding-window optimization in a low-dimensional latent space, circumventing SLAM and heavy temporal models. Evaluated on real monocular videos, our system achieves stable world-coordinate trajectory reconstruction and end-to-end, lightweight, physically plausible motion transfer to the Unitree G1 humanoid robot. The approach demonstrates strong practicality and robot readiness—enabling reliable deployment without external motion-capture systems or domain-specific fine-tuning.

Technology Category

Application Category

📝 Abstract
Recovering world-coordinate human motion from monocular videos with humanoid robot retargeting is significant for embodied intelligence and robotics. To avoid complex SLAM pipelines or heavy temporal models, we propose a lightweight, engineering-oriented framework that leverages SAM 3D Body (3DB) as a frozen perception backbone and uses the Momentum HumanRig (MHR) representation as a robot-friendly intermediate. Our method (i) locks the identity and skeleton-scale parameters of per tracked subject to enforce temporally consistent bone lengths, (ii) smooths per-frame predictions via efficient sliding-window optimization in the low-dimensional MHR latent space, and (iii) recovers physically plausible global root trajectories with a differentiable soft foot-ground contact model and contact-aware global optimization. Finally, we retarget the reconstructed motion to the Unitree G1 humanoid using a kinematics-aware two-stage inverse kinematics pipeline. Results on real monocular videos show that our method has stable world trajectories and reliable robot retargeting, indicating that structured human representations with lightweight physical constraints can yield robot-ready motion from monocular input.
Problem

Research questions and friction points this paper is trying to address.

Recover world-coordinate human motion from monocular videos
Retarget human motion to humanoid robots efficiently
Ensure physically plausible motion with lightweight constraints
Innovation

Methods, ideas, or system contributions that make the work stand out.

SAM 3D Body as frozen perception backbone
Momentum HumanRig as robot-friendly intermediate representation
Differentiable soft foot-ground contact model for global optimization
🔎 Similar Papers
No similar papers found.
Z
Zhangzheng Tu
Dalian University of Technology
K
Kailun Su
Shenzhen University
S
Shaolong Zhu
Harbin Institute of Technology, Shenzhen
Yukun Zheng
Yukun Zheng
Tsinghua University
Information retrievalmachine learningmachine reading comprehensionuser behavior modeling