VL-Nav: Real-time Vision-Language Navigation with Spatial Reasoning

📅 2025-02-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of real-time language instruction understanding and precise target localization by mobile robots in unknown environments, this paper proposes a vision-language navigation method tailored for unknown settings. The core innovation is a heuristic vision-language (HVL) spatial reasoning mechanism, which jointly performs pixel-level cross-modal feature alignment and exploration-guided heuristic path planning—thereby overcoming the limitations of single-frame image matching. The system deploys a lightweight vision-language encoder and a real-time inference framework on the Jetson Orin NX platform. Evaluated across indoor and outdoor multi-scale complex scenes, it achieves an 86.3% task success rate—44.15% higher than the state-of-the-art—and sustains a 30 Hz inference frequency. This significantly enhances both the real-time performance and robustness of semantic navigation in open, unstructured environments.

Technology Category

Application Category

📝 Abstract
Vision-language navigation in unknown environments is crucial for mobile robots. In scenarios such as household assistance and rescue, mobile robots need to understand a human command, such as"find a person wearing black". We present a novel vision-language navigation (VL-Nav) system that integrates efficient spatial reasoning on low-power robots. Unlike prior methods that rely on a single image-level feature similarity to guide a robot, we introduce the heuristic-vision-language (HVL) spatial reasoning for goal point selection. It combines pixel-wise vision-language features and heuristic exploration to enable efficient navigation to human-instructed instances in various environments robustly. We deploy VL-Nav on a four-wheel mobile robot and conduct comprehensive navigation tasks in various environments of different scales and semantic complexities, indoors and outdoors. Remarkably, VL-Nav operates at a real-time frequency of 30 Hz with a Jetson Orin NX, highlighting its ability to conduct efficient vision-language navigation. Experimental results show that VL-Nav achieves an overall success rate of 86.3%, outperforming previous methods by 44.15%.
Problem

Research questions and friction points this paper is trying to address.

Mobile Robotics
Natural Language Understanding
Object Localization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual and Language Navigation
Spatial Reasoning
Efficient Search Algorithm
🔎 Similar Papers
No similar papers found.
Yi Du
Yi Du
Chinese Academy of Sciences
data miningknowledge engineeringAI for Science
Taimeng Fu
Taimeng Fu
University at Buffalo
SLAMNavigationNeuro-Symbolic Learning
Zhuoqun Chen
Zhuoqun Chen
Duke University
RoboticsReinforcement Learning
B
Bowen Li
Robotics Institute, Carnegie Mellon University, Pittsburgh, PA 15213, USA.
Shaoshu Su
Shaoshu Su
PhD Student, University at Buffalo, SUNY
SLAMMachine LearningMPCMulti Agent System
Z
Zhipeng Zhao
Spatial AI & Robotics Lab, University at Buffalo, Buffalo, NY 14260, USA.
C
Chen Wang
Spatial AI & Robotics Lab, University at Buffalo, Buffalo, NY 14260, USA.