🤖 AI Summary
Quadrupedal robots face a perceptual gap between low-ground-view vision and human-centered language instructions, alongside poor sim-to-real generalization in vision-and-language navigation (VLN). Method: We propose Ground-View Navigation (GVNav), the first method to identify observation height disparity as a critical bottleneck for cross-platform VLN generalization. GVNav mitigates multi-view feature conflicts via weighted historical observation modeling for spatiotemporal context, and enhances spatial priors through cross-view feature disentanglement and 3D scene graph transfer—leveraging HM3D/Gibson connectivity graphs. It enables end-to-end, instruction-driven path prediction on both simulation and real-world quadruped platforms. Contribution/Results: Experiments demonstrate that GVNav significantly improves navigation success rates and cross-scene generalization, effectively addressing occlusion challenges and vision-language misalignment inherent to ground-level perspectives.
📝 Abstract
Vision-and-Language Navigation (VLN) empowers agents to associate time-sequenced visual observations with corresponding instructions to make sequential decisions. However, generalization remains a persistent challenge, particularly when dealing with visually diverse scenes or transitioning from simulated environments to real-world deployment. In this paper, we address the mismatch between human-centric instructions and quadruped robots with a low-height field of view, proposing a Ground-level Viewpoint Navigation (GVNav) approach to mitigate this issue. This work represents the first attempt to highlight the generalization gap in VLN across varying heights of visual observation in realistic robot deployments. Our approach leverages weighted historical observations as enriched spatiotemporal contexts for instruction following, effectively managing feature collisions within cells by assigning appropriate weights to identical features across different viewpoints. This enables low-height robots to overcome challenges such as visual obstructions and perceptual mismatches. Additionally, we transfer the connectivity graph from the HM3D and Gibson datasets as an extra resource to enhance spatial priors and a more comprehensive representation of real-world scenarios, leading to improved performance and generalizability of the waypoint predictor in real-world environments. Extensive experiments demonstrate that our Ground-level Viewpoint Navigation (GVnav) approach significantly improves performance in both simulated environments and real-world deployments with quadruped robots.