MeshMimic: Geometry-Aware Humanoid Motion Learning through 3D Scene Reconstruction

📅 2026-02-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing methods for humanoid motion synthesis rely on expensive motion capture data and struggle to effectively couple with environmental geometry, often resulting in physically inconsistent behaviors such as foot sliding or mesh penetration on complex terrains. This work proposes the first framework that jointly reconstructs human motion and 3D scene geometry from monocular video, enabling geometry-aware dynamic locomotion through a motion–terrain coupling learning mechanism. By integrating state-of-the-art 3D vision models, kinematic consistency optimization, and contact-invariant retargeting, the approach extracts high-fidelity motion features from noisy visual inputs without requiring motion capture data. The method demonstrates robust, highly dynamic humanoid locomotion across diverse challenging terrains, establishing the feasibility of training sophisticated physical interaction capabilities using only consumer-grade monocular sensors.

Technology Category

Application Category

📝 Abstract
Humanoid motion control has witnessed significant breakthroughs in recent years, with deep reinforcement learning (RL) emerging as a primary catalyst for achieving complex, human-like behaviors. However, the high dimensionality and intricate dynamics of humanoid robots make manual motion design impractical, leading to a heavy reliance on expensive motion capture (MoCap) data. These datasets are not only costly to acquire but also frequently lack the necessary geometric context of the surrounding physical environment. Consequently, existing motion synthesis frameworks often suffer from a decoupling of motion and scene, resulting in physical inconsistencies such as contact slippage or mesh penetration during terrain-aware tasks. In this work, we present MeshMimic, an innovative framework that bridges 3D scene reconstruction and embodied intelligence to enable humanoid robots to learn coupled "motion-terrain" interactions directly from video. By leveraging state-of-the-art 3D vision models, our framework precisely segments and reconstructs both human trajectories and the underlying 3D geometry of terrains and objects. We introduce an optimization algorithm based on kinematic consistency to extract high-quality motion data from noisy visual reconstructions, alongside a contact-invariant retargeting method that transfers human-environment interaction features to the humanoid agent. Experimental results demonstrate that MeshMimic achieves robust, highly dynamic performance across diverse and challenging terrains. Our approach proves that a low-cost pipeline utilizing only consumer-grade monocular sensors can facilitate the training of complex physical interactions, offering a scalable path toward the autonomous evolution of humanoid robots in unstructured environments.
Problem

Research questions and friction points this paper is trying to address.

humanoid motion control
motion-terrain interaction
3D scene reconstruction
physical consistency
motion capture data
Innovation

Methods, ideas, or system contributions that make the work stand out.

3D scene reconstruction
humanoid motion learning
contact-invariant retargeting
kinematic consistency optimization
embodied intelligence
🔎 Similar Papers
No similar papers found.
Qiang Zhang
Qiang Zhang
X-Humanoid
Humanoid RoboticsEmbodied AIRobotics
Jiahao Ma
Jiahao Ma
Australia National University
Computer visionMultiview detectionNovel view synthesis
P
Peiran Liu
X-Humanoid; The Hong Kong University of Science and Technology (Guangzhou)
S
Shuai Shi
X-Humanoid
Z
Zeran Su
X-Humanoid
Z
Zifan Wang
The Hong Kong University of Science and Technology (Guangzhou)
J
Jingkai Sun
X-Humanoid; The University of Hong Kong
W
Wei Cui
X-Humanoid
J
Jialin Yu
X-Humanoid
Gang Han
Gang Han
Professor of Biostatistics, Texas A&M University
StatisticsBiostatisticsMedical researchComputer experiments
Wen Zhao
Wen Zhao
JSPS International Fellow, UT-Austin Postdoc, KAUST
MEMSSensorNonlinear Dynamics
Pihai Sun
Pihai Sun
Harbin Institute of Technology
Kangning Yin
Kangning Yin
Shanghai Jiao Tong University
roboticshumanoidembodied ai
J
Jiaxu Wang
The Chinese University of Hong Kong
Jiahang Cao
Jiahang Cao
The University of Hong Kong
Robot LearningGenerative ModelsCognitive-inspired Models
Lingfeng Zhang
Lingfeng Zhang
PhD student at Tsinghua University
embodied ai
H
Hao Cheng
The Hong Kong University of Science and Technology (Guangzhou)
Xiaoshuai Hao
Xiaoshuai Hao
Beijing Academy of Artificial Intelligence,BAAI
vision and language
Y
Yiding Ji
The Hong Kong University of Science and Technology (Guangzhou)
Junwei Liang
Junwei Liang
Assistant Professor, HKUST (Guangzhou) | CSE, HKUST | Ph.D. @CMU
Computer VisionRoboticsEmbodied AITrajectory Prediction
J
Jian Tang
X-Humanoid
Renjing Xu
Renjing Xu
HKUST(GZ)
Brain-inspired ComputingHumanoid Computing
Y
Yijie Guo
X-Humanoid