🤖 AI Summary
Current terrain-aware locomotion for humanoid robots faces two key bottlenecks: (1) end-to-end depth-image methods suffer from inefficient training and a significant Sim2Real depth perception gap; (2) elevation-map-based approaches rely on multiple visual sensors and high-precision localization, resulting in high latency and poor robustness. This paper proposes a lightweight, single-depth-camera-only terrain-aware locomotion framework. Its core contributions are: (1) blind-spot-aware backbone locomotion policy-guided reinforcement learning; (2) a cross-modal cross-attention Transformer for structured terrain reconstruction; and (3) self-occlusion-aware ray-casting with noise modeling for synthetic depth image generation, substantially suppressing reconstruction error. Evaluated on a full-scale humanoid robot, the framework achieves >30% reduction in terrain reconstruction error, improved training efficiency, and enhanced deployment robustness—eliminating dependence on multi-sensor fusion and precise localization.
📝 Abstract
Recent advancements in legged robot perceptive locomotion have shown promising progress. However, terrain-aware humanoid locomotion remains largely constrained to two paradigms: depth image-based end-to-end learning and elevation map-based methods. The former suffers from limited training efficiency and a significant sim-to-real gap in depth perception, while the latter depends heavily on multiple vision sensors and localization systems, resulting in latency and reduced robustness. To overcome these challenges, we propose a novel framework that tightly integrates three key components: (1) Terrain-Aware Locomotion Policy with a Blind Backbone, which leverages pre-trained elevation map-based perception to guide reinforcement learning with minimal visual input; (2) Multi-Modality Cross-Attention Transformer, which reconstructs structured terrain representations from noisy depth images; (3) Realistic Depth Images Synthetic Method, which employs self-occlusion-aware ray casting and noise-aware modeling to synthesize realistic depth observations, achieving over 30% reduction in terrain reconstruction error. This combination enables efficient policy training with limited data and hardware resources, while preserving critical terrain features essential for generalization. We validate our framework on a full-sized humanoid robot, demonstrating agile and adaptive locomotion across diverse and challenging terrains.