STRIVE: Structured Representation Integrating VLM Reasoning for Efficient Object Navigation

📅 2025-05-10

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

In embodied AI, vision-language models (VLMs) suffer from inefficient backtracking in object navigation due to coarse-grained environmental representations and blind query strategies. To address this, we propose a multi-level structured environment representation—comprising viewpoint, object, and room nodes—coupled with a two-stage VLM collaborative navigation framework: an upper stage performs semantic planning via graph-structured hierarchical reinforcement learning, while a lower stage enables fine-grained exploration through VLM-conditioned reasoning and real-time incremental mapping. This design decouples and integrates high-level planning with low-level execution, substantially mitigating insufficient environmental understanding and excessive VLM reliance. Our method achieves state-of-the-art performance on three major simulation benchmarks—HM3D, RoboTHOR, and MP3D—improving success rate by 7.1% and navigation efficiency by 12.5%. Furthermore, it demonstrates strong robustness across 15 object-navigation tasks in 10 diverse real-world indoor environments using a physical robot.

Technology Category

Application Category

📝 Abstract

Vision-Language Models (VLMs) have been increasingly integrated into object navigation tasks for their rich prior knowledge and strong reasoning abilities. However, applying VLMs to navigation poses two key challenges: effectively representing complex environment information and determining extit{when and how} to query VLMs. Insufficient environment understanding and over-reliance on VLMs (e.g. querying at every step) can lead to unnecessary backtracking and reduced navigation efficiency, especially in continuous environments. To address these challenges, we propose a novel framework that constructs a multi-layer representation of the environment during navigation. This representation consists of viewpoint, object nodes, and room nodes. Viewpoints and object nodes facilitate intra-room exploration and accurate target localization, while room nodes support efficient inter-room planning. Building on this representation, we propose a novel two-stage navigation policy, integrating high-level planning guided by VLM reasoning with low-level VLM-assisted exploration to efficiently locate a goal object. We evaluated our approach on three simulated benchmarks (HM3D, RoboTHOR, and MP3D), and achieved state-of-the-art performance on both the success rate ($mathord{uparrow}, 7.1%$) and navigation efficiency ($mathord{uparrow}, 12.5%$). We further validate our method on a real robot platform, demonstrating strong robustness across 15 object navigation tasks in 10 different indoor environments. Project page is available at https://zwandering.github.io/STRIVE.github.io/ .

Problem

Research questions and friction points this paper is trying to address.

Effectively representing complex environment information for navigation

Determining optimal timing and method for VLM queries

Reducing backtracking and improving efficiency in continuous environments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-layer environment representation for navigation

Two-stage VLM-integrated navigation policy

Enhanced efficiency and success rate benchmarks

🔎 Similar Papers

Find Everything: A General Vision Language Model Approach to Multi-Object Search