ReMemNav: A Rethinking and Memory-Augmented Framework for Zero-Shot Object Navigation

📅 2026-03-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses spatial hallucination, exploration deadlock, and semantic-control disconnection in zero-shot object navigation—challenges arising from the absence of prior maps and task-specific training—by proposing a hierarchical navigation framework. The approach integrates panoramic semantic priors with episodic memory, leveraging a vision-language model (VLM) for spatial reasoning and employing the Recognize Anything Model to anchor target regions. It introduces an adaptive dual-modality reflection mechanism based on an episodic semantic buffer queue, which validates target visibility against historical memory to refine decisions and generate executable action sequences via depth-aware masking. Evaluated on HM3D and MP3D benchmarks, the method substantially outperforms existing zero-shot approaches, achieving an 18.2% absolute gain in success rate (SR) and an 11.1% improvement in Success weighted by Path Length (SPL) on HM3D v0.2, along with 8.7% and 7.9% gains in SR and SPL, respectively, on MP3D.
📝 Abstract
Zero-shot object navigation requires agents to locate unseen target objects in unfamiliar environments without prior maps or task-specific training which remains a significant challenge. Although recent advancements in vision-language models(VLMs) provide promising commonsense reasoning capabilities for this task, these models still suffer from spatial hallucinations, local exploration deadlocks, and a disconnect between high-level semantic intent and low-level control. In this regard, we propose a novel hierarchical navigation framework named ReMemNav, which seamlessly integrates panoramic semantic priors and episodic memory with VLMs. We introduce the Recognize Anything Model to anchor the spatial reasoning process of the VLM. We also design an adaptive dual-modal rethinking mechanism based on an episodic semantic buffer queue. The proposed mechanism actively verifies target visibility and corrects decisions using historical memory to prevent deadlocks. For low-level action execution, ReMemNav extracts a sequence of feasible actions using depth masks, allowing the VLM to select the optimal action for mapping into actual spatial movement. Extensive evaluations on HM3D and MP3D demonstrate that ReMemNav outperforms existing training-free zero-shot baselines in both success rate and exploration efficiency. Specifically, we achieve significant absolute performance improvements, with SR and SPL increasing by 1.7% and 7.0% on HM3D v0.1, 18.2% and 11.1% on HM3D v0.2, and 8.7% and 7.9% on MP3D.
Problem

Research questions and friction points this paper is trying to address.

zero-shot object navigation
spatial hallucinations
exploration deadlocks
semantic-action disconnect
unfamiliar environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

zero-shot object navigation
vision-language models
episodic memory
spatial reasoning
rethinking mechanism
🔎 Similar Papers
No similar papers found.