🤖 AI Summary
This work addresses the performance degradation in cross-embodiment visual navigation caused by differences in robot morphology and camera configurations. Existing approaches often rely on extensive data or fine-tuning and overlook geometric consistency. To overcome these limitations, the authors propose the CeRLP framework, which recovers metrically accurate depth from monocular images via offline pre-calibrated scale correction and abstracts visual inputs into a unified virtual LiDAR scan. This geometric unification enables consistent modeling across heterogeneous robots—varying in size, camera parameters, and sensor types—without requiring online fine-tuning. CeRLP significantly enhances cross-platform generalization and obstacle avoidance in both point-to-point and vision-and-language navigation tasks. Experiments demonstrate its superiority over current methods in both simulated and real-world environments.
📝 Abstract
Visual navigation for cross-embodiment robots is challenging due to variations in robot and camera configurations, which can lead to the failure of navigation tasks. Previous approaches typically rely on collecting massive datasets across different robots, which is highly data-intensive, or fine-tuning models, which is time-consuming. Furthermore, both methods often lack explicit consideration of robot geometry. In this paper, we propose a Cross-embodiment Robot Local Planning (CeRLP) framework for general visual navigation, which abstracts visual information into a unified geometric formulation and applies to heterogeneous robots with varying physical dimensions, camera parameters, and camera types. CeRLP introduces a depth estimation scale correction method that utilizes offline pre-calibration to resolve the scale ambiguity of monocular depth estimation, thereby recovering precise metric depth images. Furthermore, CeRLP designs a visual-to-scan abstraction module that projects varying visual inputs into height-adaptive laser scans, making the policy robust to heterogeneous robots. Experiments in simulation environments demonstrate that CeRLP outperforms comparative methods, validating its robust obstacle avoidance capabilities as a local planner. Additionally, extensive real-world experiments verify the effectiveness of CeRLP in tasks such as point-to-point navigation and vision-language navigation, demonstrating its generalization across varying robot and camera configurations.