GC-VLN: Instruction as Graph Constraints for Training-free Vision-and-Language Navigation

📅 2025-09-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing zero-shot vision-and-language navigation (VLN) methods are designed for discrete environments or unsupervised continuous simulations, limiting generalization to real-world continuous settings. This paper proposes a training-free zero-shot VLN framework: natural language instructions are parsed into explicit spatial constraint-directed acyclic graphs; a comprehensive constraint library—covering all canonical spatial relations—is leveraged to enable path reasoning in continuous environments via constraint solving. To handle ambiguous or infeasible instructions, we introduce a navigation tree with backtracking, significantly improving robustness. To our knowledge, this is the first work to model instructions as structured graph-based constraints for zero-shot navigation. Our method achieves state-of-the-art success rates on standard benchmarks and demonstrates strong cross-environment and cross-instruction generalization, validated through real-robot experiments.

Technology Category

Application Category

📝 Abstract
In this paper, we propose a training-free framework for vision-and-language navigation (VLN). Existing zero-shot VLN methods are mainly designed for discrete environments or involve unsupervised training in continuous simulator environments, which makes it challenging to generalize and deploy them in real-world scenarios. To achieve a training-free framework in continuous environments, our framework formulates navigation guidance as graph constraint optimization by decomposing instructions into explicit spatial constraints. The constraint-driven paradigm decodes spatial semantics through constraint solving, enabling zero-shot adaptation to unseen environments. Specifically, we construct a spatial constraint library covering all types of spatial relationship mentioned in VLN instructions. The human instruction is decomposed into a directed acyclic graph, with waypoint nodes, object nodes and edges, which are used as queries to retrieve the library to build the graph constraints. The graph constraint optimization is solved by the constraint solver to determine the positions of waypoints, obtaining the robot's navigation path and final goal. To handle cases of no solution or multiple solutions, we construct a navigation tree and the backtracking mechanism. Extensive experiments on standard benchmarks demonstrate significant improvements in success rate and navigation efficiency compared to state-of-the-art zero-shot VLN methods. We further conduct real-world experiments to show that our framework can effectively generalize to new environments and instruction sets, paving the way for a more robust and autonomous navigation framework.
Problem

Research questions and friction points this paper is trying to address.

Training-free vision-language navigation in continuous environments
Decomposing instructions into spatial graph constraints
Zero-shot adaptation to unseen environments via constraint solving
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free framework for VLN
Instruction as graph constraint optimization
Constraint solver for navigation path
🔎 Similar Papers
No similar papers found.
H
Hang Yin
Department of Automation, Tsinghua University; Beijing Key Laboratory of Embodied Intelligence Systems; Beijing National Research Center for Information Science and Technology
H
Haoyu Wei
Department of Automation, Tsinghua University; Beijing Key Laboratory of Embodied Intelligence Systems; Beijing National Research Center for Information Science and Technology
Xiuwei Xu
Xiuwei Xu
Tsinghua University
computer visionembodied AI
W
Wenxuan Guo
Department of Automation, Tsinghua University; Beijing Key Laboratory of Embodied Intelligence Systems; Beijing National Research Center for Information Science and Technology
J
Jie Zhou
Department of Automation, Tsinghua University; Beijing Key Laboratory of Embodied Intelligence Systems; Beijing National Research Center for Information Science and Technology
J
Jiwen Lu
Department of Automation, Tsinghua University; Beijing Key Laboratory of Embodied Intelligence Systems; Beijing National Research Center for Information Science and Technology