VLN-Zero: Rapid Exploration and Cache-Enabled Neurosymbolic Vision-Language Planning for Zero-Shot Transfer in Robot Navigation

📅 2025-09-22

📈 Citations: 0

✨ Influential: 0

career value

238K/year

🤖 AI Summary

Existing robot navigation methods exhibit poor generalization in unknown environments, heavily relying on exhaustive exploration or costly parameter fine-tuning. Method: This paper proposes a two-stage zero-shot neuro-symbolic navigation framework. In Stage I, structured prompting guides a vision-language model (VLM) to efficiently explore the environment and construct a symbolic scene graph. In Stage II, neuro-symbolic planning operates over the scene graph, augmented by a cache-reuse mechanism to accelerate decision-making. Contribution/Results: To our knowledge, this is the first zero-shot end-to-end vision-language navigation approach achieving low computational overhead—requiring neither environmental priors nor parameter adaptation. Experiments across diverse unknown environments demonstrate a 2× improvement in success rate over prior zero-shot methods, a 50% reduction in traversal time, and a 55% decrease in VLM invocations—outperforming most fine-tuned baselines.

Technology Category

Application Category

📝 Abstract

Rapid adaptation in unseen environments is essential for scalable real-world autonomy, yet existing approaches rely on exhaustive exploration or rigid navigation policies that fail to generalize. We present VLN-Zero, a two-phase vision-language navigation framework that leverages vision-language models to efficiently construct symbolic scene graphs and enable zero-shot neurosymbolic navigation. In the exploration phase, structured prompts guide VLM-based search toward informative and diverse trajectories, yielding compact scene graph representations. In the deployment phase, a neurosymbolic planner reasons over the scene graph and environmental observations to generate executable plans, while a cache-enabled execution module accelerates adaptation by reusing previously computed task-location trajectories. By combining rapid exploration, symbolic reasoning, and cache-enabled execution, the proposed framework overcomes the computational inefficiency and poor generalization of prior vision-language navigation methods, enabling robust and scalable decision-making in unseen environments. VLN-Zero achieves 2x higher success rate compared to state-of-the-art zero-shot models, outperforms most fine-tuned baselines, and reaches goal locations in half the time with 55% fewer VLM calls on average compared to state-of-the-art models across diverse environments. Codebase, datasets, and videos for VLN-Zero are available at: https://vln-zero.github.io/.

Problem

Research questions and friction points this paper is trying to address.

Addressing poor generalization of navigation policies in unseen environments

Overcoming computational inefficiency in vision-language navigation methods

Enabling rapid adaptation without exhaustive exploration or rigid policies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-phase framework with exploration and neurosymbolic planning

Cache-enabled execution reuses computed task-location trajectories

Vision-language models construct symbolic scene graphs for reasoning

🔎 Similar Papers

No similar papers found.