🤖 AI Summary
In embodied AI, vision-language models (VLMs) suffer from inefficient backtracking in object navigation due to coarse-grained environmental representations and blind query strategies. To address this, we propose a multi-level structured environment representation—comprising viewpoint, object, and room nodes—coupled with a two-stage VLM collaborative navigation framework: an upper stage performs semantic planning via graph-structured hierarchical reinforcement learning, while a lower stage enables fine-grained exploration through VLM-conditioned reasoning and real-time incremental mapping. This design decouples and integrates high-level planning with low-level execution, substantially mitigating insufficient environmental understanding and excessive VLM reliance. Our method achieves state-of-the-art performance on three major simulation benchmarks—HM3D, RoboTHOR, and MP3D—improving success rate by 7.1% and navigation efficiency by 12.5%. Furthermore, it demonstrates strong robustness across 15 object-navigation tasks in 10 diverse real-world indoor environments using a physical robot.
📝 Abstract
Vision-Language Models (VLMs) have been increasingly integrated into object navigation tasks for their rich prior knowledge and strong reasoning abilities. However, applying VLMs to navigation poses two key challenges: effectively representing complex environment information and determining extit{when and how} to query VLMs. Insufficient environment understanding and over-reliance on VLMs (e.g. querying at every step) can lead to unnecessary backtracking and reduced navigation efficiency, especially in continuous environments. To address these challenges, we propose a novel framework that constructs a multi-layer representation of the environment during navigation. This representation consists of viewpoint, object nodes, and room nodes. Viewpoints and object nodes facilitate intra-room exploration and accurate target localization, while room nodes support efficient inter-room planning. Building on this representation, we propose a novel two-stage navigation policy, integrating high-level planning guided by VLM reasoning with low-level VLM-assisted exploration to efficiently locate a goal object. We evaluated our approach on three simulated benchmarks (HM3D, RoboTHOR, and MP3D), and achieved state-of-the-art performance on both the success rate ($mathord{uparrow}, 7.1%$) and navigation efficiency ($mathord{uparrow}, 12.5%$). We further validate our method on a real robot platform, demonstrating strong robustness across 15 object navigation tasks in 10 different indoor environments. Project page is available at https://zwandering.github.io/STRIVE.github.io/ .