MerNav: A Highly Generalizable Memory-Execute-Review Framework for Zero-Shot Object Goal Navigation

📅 2026-02-05

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Existing vision-language navigation methods struggle to simultaneously achieve high success rates and strong generalization: supervised fine-tuning exhibits limited generalization, while training-free approaches suffer from low success rates. This work proposes a Memory-Execute-Review (MER) framework that, for the first time, integrates hierarchical memory, zero-shot reasoning, and anomaly-driven behavior correction into a unified, training-free paradigm for efficient zero-shot navigation. The method achieves an average success rate improvement of 7% over training-free baselines and 5% over zero-shot baselines across four datasets. Notably, it outperforms all supervised fine-tuning and training-free methods on challenging benchmarks such as HM3D_OVON, effectively overcoming the longstanding trade-off between performance and generalization in zero-shot navigation.

Technology Category

Application Category

📝 Abstract

Visual Language Navigation (VLN) is one of the fundamental capabilities for embodied intelligence and a critical challenge that urgently needs to be addressed. However, existing methods are still unsatisfactory in terms of both success rate (SR) and generalization: Supervised Fine-Tuning (SFT) approaches typically achieve higher SR, while Training-Free (TF) approaches often generalize better, but it is difficult to obtain both simultaneously. To this end, we propose a Memory-Execute-Review framework. It consists of three parts: a hierarchical memory module for providing information support, an execute module for routine decision-making and actions, and a review module for handling abnormal situations and correcting behavior. We validated the effectiveness of this framework on the Object Goal Navigation task. Across 4 datasets, our average SR achieved absolute improvements of 7% and 5% compared to all baseline methods under TF and Zero-Shot (ZS) settings, respectively. On the most commonly used HM3D_v0.1 and the more challenging open vocabulary dataset HM3D_OVON, the SR improved by 8% and 6%, under ZS settings. Furthermore, on the MP3D and HM3D_OVON datasets, our method not only outperformed all TF methods but also surpassed all SFT methods, achieving comprehensive leadership in both SR (5% and 2%) and generalization.

Problem

Research questions and friction points this paper is trying to address.

Visual Language Navigation

Zero-Shot Navigation

Generalization

Success Rate

Object Goal Navigation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Memory-Execute-Review framework

Zero-Shot Object Goal Navigation

Generalization