WMNav: Integrating Vision-Language Models into World Models for Object Goal Navigation

📅 2025-03-04

📈 Citations: 0

✨ Influential: 0

career value

232K/year

🤖 AI Summary

To address the high-risk, high-cost challenges of zero-shot object navigation in unknown environments for embodied AI, this paper proposes a Vision-Language Model (VLM)-enhanced world model framework. Methodologically: (1) we design a modular VLM–World Model joint architecture to predict environmental state evolution; (2) we introduce an online-updated Curiosity Value Map as dynamic spatial memory; (3) we adopt a two-stage hierarchical action policy—broad-area exploration followed by precise localization; and (4) we mitigate hallucination via human-factor–informed cognitive decomposition and feedback-driven suppression of discrepancies between real observations and model predictions. Evaluated under zero-shot settings on HM3D and MP3D, our method significantly outperforms state-of-the-art approaches: +3.2% absolute gains in Success Rate (SR) and SPL on HM3D, and +13.5% SR and +1.1% SPL on MP3D.

Technology Category

Application Category

📝 Abstract

Object Goal Navigation-requiring an agent to locate a specific object in an unseen environment-remains a core challenge in embodied AI. Although recent progress in Vision-Language Model (VLM)-based agents has demonstrated promising perception and decision-making abilities through prompting, none has yet established a fully modular world model design that reduces risky and costly interactions with the environment by predicting the future state of the world. We introduce WMNav, a novel World Model-based Navigation framework powered by Vision-Language Models (VLMs). It predicts possible outcomes of decisions and builds memories to provide feedback to the policy module. To retain the predicted state of the environment, WMNav proposes the online maintained Curiosity Value Map as part of the world model memory to provide dynamic configuration for navigation policy. By decomposing according to a human-like thinking process, WMNav effectively alleviates the impact of model hallucination by making decisions based on the feedback difference between the world model plan and observation. To further boost efficiency, we implement a two-stage action proposer strategy: broad exploration followed by precise localization. Extensive evaluation on HM3D and MP3D validates WMNav surpasses existing zero-shot benchmarks in both success rate and exploration efficiency (absolute improvement: +3.2% SR and +3.2% SPL on HM3D, +13.5% SR and +1.1% SPL on MP3D). Project page: https://b0b8k1ng.github.io/WMNav/.

Problem

Research questions and friction points this paper is trying to address.

Develops a navigation framework for locating objects in unseen environments.

Reduces risky interactions by predicting future world states using VLMs.

Improves navigation efficiency and success rates in zero-shot benchmarks.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates Vision-Language Models into World Models

Uses Curiosity Value Map for dynamic navigation policy

Implements two-stage action strategy for efficiency

🔎 Similar Papers

Navigation with VLM framework: Go to Any Language