🤖 AI Summary
To address the challenge of real-time language instruction understanding and precise target localization by mobile robots in unknown environments, this paper proposes a vision-language navigation method tailored for unknown settings. The core innovation is a heuristic vision-language (HVL) spatial reasoning mechanism, which jointly performs pixel-level cross-modal feature alignment and exploration-guided heuristic path planning—thereby overcoming the limitations of single-frame image matching. The system deploys a lightweight vision-language encoder and a real-time inference framework on the Jetson Orin NX platform. Evaluated across indoor and outdoor multi-scale complex scenes, it achieves an 86.3% task success rate—44.15% higher than the state-of-the-art—and sustains a 30 Hz inference frequency. This significantly enhances both the real-time performance and robustness of semantic navigation in open, unstructured environments.
📝 Abstract
Vision-language navigation in unknown environments is crucial for mobile robots. In scenarios such as household assistance and rescue, mobile robots need to understand a human command, such as"find a person wearing black". We present a novel vision-language navigation (VL-Nav) system that integrates efficient spatial reasoning on low-power robots. Unlike prior methods that rely on a single image-level feature similarity to guide a robot, we introduce the heuristic-vision-language (HVL) spatial reasoning for goal point selection. It combines pixel-wise vision-language features and heuristic exploration to enable efficient navigation to human-instructed instances in various environments robustly. We deploy VL-Nav on a four-wheel mobile robot and conduct comprehensive navigation tasks in various environments of different scales and semantic complexities, indoors and outdoors. Remarkably, VL-Nav operates at a real-time frequency of 30 Hz with a Jetson Orin NX, highlighting its ability to conduct efficient vision-language navigation. Experimental results show that VL-Nav achieves an overall success rate of 86.3%, outperforming previous methods by 44.15%.