GeoLoco: Leveraging 3D Geometric Priors from Visual Foundation Model for Robust RGB-Only Humanoid Locomotion

📅 2026-03-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of current perception-driven humanoid robots, which often rely on depth sensors and struggle to leverage semantic and appearance cues from monocular RGB images, while end-to-end RGB-based control suffers from poor sample efficiency and sim-to-real transfer challenges. The authors propose GeoLoco, a framework that extracts 3D geometric priors from monocular RGB inputs using a frozen scale-aware vision foundation model (VFM). It introduces a proprioceptive query-based multi-head cross-attention mechanism to dynamically fuse topological features aligned with gait phase. Coupled with a two-headed auxiliary learning strategy, the method effectively disentangles texture and geometry, ensuring high-dimensional representations accurately align with real-world terrain. Trained exclusively in simulation using only RGB input, GeoLoco achieves zero-shot transfer to the Unitree G1 humanoid robot, demonstrating robust locomotion across diverse complex terrains.

Technology Category

Application Category

📝 Abstract
The prevailing paradigm of perceptive humanoid locomotion relies heavily on active depth sensors. However, this depth-centric approach fundamentally discards the rich semantic and dense appearance cues of the visual world, severing low-level control from the high-level reasoning essential for general embodied intelligence. While monocular RGB offers a ubiquitous, information-dense alternative, end-to-end reinforcement learning from raw 2D pixels suffers from extreme sample inefficiency and catastrophic sim-to-real collapse due to the inherent loss of geometric scale. To break this deadlock, we propose GeoLoco, a purely RGB-driven locomotion framework that conceptualizes monocular images as high-dimensional 3D latent representations by harnessing the powerful geometric priors of a frozen, scale-aware Visual Foundation Model (VFM). Rather than naive feature concatenation, we design a proprioceptive-query multi-head cross-attention mechanism that dynamically attends to task-critical topological features conditioned on the robot's real-time gait phase. Crucially, to prevent the policy from overfitting to superficial textures, we introduce a dual-head auxiliary learning scheme. This explicit regularization forces the high-dimensional latent space to strictly align with the physical terrain geometry, ensuring robust zero-shot sim-to-real transfer. Trained exclusively in simulation, GeoLoco achieves robust zero-shot transfer to the Unitree G1 humanoid and successfully negotiates challenging terrains.
Problem

Research questions and friction points this paper is trying to address.

RGB-only locomotion
sim-to-real transfer
geometric priors
humanoid locomotion
monocular vision
Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual Foundation Model
monocular RGB locomotion
3D geometric priors
cross-attention mechanism
sim-to-real transfer
🔎 Similar Papers
No similar papers found.
Y
Yufei Liu
College of Intelligence Science and Technology, National University of Defense Technology, China
Xieyuanli Chen
Xieyuanli Chen
Associate Professor, NUDT, China
RoboticsSLAMLocalizationLiDAR PerceptionRobot Learning
H
Hainan Pan
College of Intelligence Science and Technology, National University of Defense Technology, China
C
Chenghao Shi
College of Intelligence Science and Technology, National University of Defense Technology, China
Y
Yanjie Chen
College of Intelligence Science and Technology, National University of Defense Technology, China
K
Kaihong Huang
College of Intelligence Science and Technology, National University of Defense Technology, China
Z
Zhiwen Zeng
College of Intelligence Science and Technology, National University of Defense Technology, China
Huimin Lu
Huimin Lu
National University of Defense Technology
Robot VisionMulti-robot CoordinationRobot SoccerRobot Rescue