DPL: Depth-only Perceptive Humanoid Locomotion via Realistic Depth Synthesis and Cross-Attention Terrain Reconstruction

📅 2025-10-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current terrain-aware locomotion for humanoid robots faces two key bottlenecks: (1) end-to-end depth-image methods suffer from inefficient training and a significant Sim2Real depth perception gap; (2) elevation-map-based approaches rely on multiple visual sensors and high-precision localization, resulting in high latency and poor robustness. This paper proposes a lightweight, single-depth-camera-only terrain-aware locomotion framework. Its core contributions are: (1) blind-spot-aware backbone locomotion policy-guided reinforcement learning; (2) a cross-modal cross-attention Transformer for structured terrain reconstruction; and (3) self-occlusion-aware ray-casting with noise modeling for synthetic depth image generation, substantially suppressing reconstruction error. Evaluated on a full-scale humanoid robot, the framework achieves >30% reduction in terrain reconstruction error, improved training efficiency, and enhanced deployment robustness—eliminating dependence on multi-sensor fusion and precise localization.

Technology Category

Application Category

📝 Abstract
Recent advancements in legged robot perceptive locomotion have shown promising progress. However, terrain-aware humanoid locomotion remains largely constrained to two paradigms: depth image-based end-to-end learning and elevation map-based methods. The former suffers from limited training efficiency and a significant sim-to-real gap in depth perception, while the latter depends heavily on multiple vision sensors and localization systems, resulting in latency and reduced robustness. To overcome these challenges, we propose a novel framework that tightly integrates three key components: (1) Terrain-Aware Locomotion Policy with a Blind Backbone, which leverages pre-trained elevation map-based perception to guide reinforcement learning with minimal visual input; (2) Multi-Modality Cross-Attention Transformer, which reconstructs structured terrain representations from noisy depth images; (3) Realistic Depth Images Synthetic Method, which employs self-occlusion-aware ray casting and noise-aware modeling to synthesize realistic depth observations, achieving over 30% reduction in terrain reconstruction error. This combination enables efficient policy training with limited data and hardware resources, while preserving critical terrain features essential for generalization. We validate our framework on a full-sized humanoid robot, demonstrating agile and adaptive locomotion across diverse and challenging terrains.
Problem

Research questions and friction points this paper is trying to address.

Overcoming terrain-aware humanoid locomotion limitations in training efficiency and sim-to-real gap
Reducing dependency on multiple vision sensors and localization systems for robustness
Reconstructing structured terrain representations from noisy depth images efficiently
Innovation

Methods, ideas, or system contributions that make the work stand out.

Blind backbone policy with minimal visual input
Cross-attention transformer reconstructs terrain from depth
Realistic depth synthesis reduces terrain reconstruction error
🔎 Similar Papers
No similar papers found.
J
Jingkai Sun
Beijing Innovation Center of Humanoid Robotics Co. Ltd.
Gang Han
Gang Han
Professor of Biostatistics, Texas A&M University
StatisticsBiostatisticsMedical researchComputer experiments
Pihai Sun
Pihai Sun
Harbin Institute of Technology
Wen Zhao
Wen Zhao
JSPS International Fellow, UT-Austin Postdoc, KAUST
MEMSSensorNonlinear Dynamics
Jiahang Cao
Jiahang Cao
The University of Hong Kong
Robot LearningGenerative ModelsCognitive-inspired Models
J
Jiaxu Wang
The Hong Kong University of Science and Technology, China
Y
Yijie Guo
Beijing Innovation Center of Humanoid Robotics Co. Ltd.
Q
Qiang Zhang
Beijing Innovation Center of Humanoid Robotics Co. Ltd.