🤖 AI Summary
This work addresses the high latency of existing large language model (LLM)-based zero-shot open-vocabulary object navigation methods, which hinders real-time deployment due to frequent LLM queries. The authors propose an efficient navigation framework that operates without any large language or vision-language models. By reinterpreting ray frontiers—originally an exploration bias mechanism—as direction-conditioned semantic goal representations and integrating them with sparsely stored language-aligned features for embedding scoring and tracking, the method achieves the first explicit semantic interpretation of ray frontiers. A lightweight R2F-VLN module is introduced to parse free-form natural language instructions, relying solely on classical mapping and planning pipelines throughout. Evaluated in both Habitat-sim simulations and real-world robotic platforms, the approach matches state-of-the-art zero-shot performance while achieving up to a 6× speedup in inference, substantially meeting real-time requirements.
📝 Abstract
Zero-shot open-vocabulary object navigation has progressed rapidly with the emergence of large Vision-Language Models (VLMs) and Large Language Models (LLMs), now widely used as high-level decision-makers instead of end-to-end policies. Although effective, such systems often rely on iterative large-model queries at inference time, introducing latency and computational overhead that limit real-time deployment. To address this problem, we repurpose ray frontiers (R2F), a recently proposed frontier-based exploration paradigm, to develop an LLM-free framework for indoor open-vocabulary object navigation. While ray frontiers were originally used to bias exploration using semantic cues carried along rays, we reinterpret frontier regions as explicit, direction-conditioned semantic hypotheses that serve as navigation goals. Language-aligned features accumulated along out-of-range rays are stored sparsely at frontiers, where each region maintains multiple directional embeddings encoding plausible unseen content. In this way, navigation then reduces to embedding-based frontier scoring and goal tracking within a classical mapping and planning pipeline, eliminating iterative large-model reasoning. We further introduce R2F-VLN, a lightweight extension for free-form language instructions using syntactic parsing and relational verification without additional VLM or LLM components. Experiments in Habitat-sim and on a real robotic platform demonstrate competitive state-of-the-art zero-shot performance with real-time execution, achieving up to 6 times faster runtime than VLM-based alternatives.