VL-Nav: Real-time Vision-Language Navigation with Spatial Reasoning

📅 2025-02-02

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

To address the challenge of real-time language instruction understanding and precise target localization by mobile robots in unknown environments, this paper proposes a vision-language navigation method tailored for unknown settings. The core innovation is a heuristic vision-language (HVL) spatial reasoning mechanism, which jointly performs pixel-level cross-modal feature alignment and exploration-guided heuristic path planning—thereby overcoming the limitations of single-frame image matching. The system deploys a lightweight vision-language encoder and a real-time inference framework on the Jetson Orin NX platform. Evaluated across indoor and outdoor multi-scale complex scenes, it achieves an 86.3% task success rate—44.15% higher than the state-of-the-art—and sustains a 30 Hz inference frequency. This significantly enhances both the real-time performance and robustness of semantic navigation in open, unstructured environments.

Technology Category

Application Category

📝 Abstract

Vision-language navigation in unknown environments is crucial for mobile robots. In scenarios such as household assistance and rescue, mobile robots need to understand a human command, such as"find a person wearing black". We present a novel vision-language navigation (VL-Nav) system that integrates efficient spatial reasoning on low-power robots. Unlike prior methods that rely on a single image-level feature similarity to guide a robot, we introduce the heuristic-vision-language (HVL) spatial reasoning for goal point selection. It combines pixel-wise vision-language features and heuristic exploration to enable efficient navigation to human-instructed instances in various environments robustly. We deploy VL-Nav on a four-wheel mobile robot and conduct comprehensive navigation tasks in various environments of different scales and semantic complexities, indoors and outdoors. Remarkably, VL-Nav operates at a real-time frequency of 30 Hz with a Jetson Orin NX, highlighting its ability to conduct efficient vision-language navigation. Experimental results show that VL-Nav achieves an overall success rate of 86.3%, outperforming previous methods by 44.15%.

Problem

Research questions and friction points this paper is trying to address.

Mobile Robotics

Natural Language Understanding

Object Localization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual and Language Navigation

Spatial Reasoning

Efficient Search Algorithm

🔎 Similar Papers

Vision-and-Language Navigation Today and Tomorrow: A Survey in the Era of Foundation Models