ConsistNav: Closing the Action Consistency Gap in Zero-Shot Object Navigation with Semantic Executive Control

📅 2026-05-10

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

This work addresses the challenge in zero-shot object navigation where agents often exhibit inconsistent actions—such as lingering or abandoning the task—after detecting the target. To resolve this, the authors propose ConsistNav, a training-free framework that coordinates a finite state machine, persistent candidate memory, and stability-aware action control through a semantic execution controller, enabling consistent utilization of cross-frame semantic evidence and suppressing erratic decisions. This approach introduces, for the first time, a training-free semantic execution mechanism that integrates open-vocabulary detectors, vision-language models, and language-guided exploration. Evaluated on HM3D and MP3D benchmarks, ConsistNav achieves state-of-the-art zero-shot ObjectNav performance, improving success rate by 11.4% and success weighted by path length by 7.9% on MP3D.

📝 Abstract

Zero-shot object navigation has advanced rapidly with open-vocabulary detectors, image--text models, and language-guided exploration. However, even after current methods detect a plausible target hypothesis, the agent may still oscillate between exploration and pursuit, or abandon the object near success. We identify this failure mode as an action consistency gap: semantic evidence is repeatedly reinterpreted at each step without persistent commitment across the episode. We introduce ConsistNav, a training-free zero-shot ObjectNav framework built around a semantic executive composed of three coordinated modules: Finite-State Executive Controller stages target pursuit through guarded semantic phases; Persistent Candidate Memory accumulates cross-frame target evidence into stable object hypotheses; and Stability-Aware Action Control suppresses rotational stagnation, ineffective pursuit, and unverified stopping. This design changes neither the detector nor the low-level planner; instead, it controls when semantic evidence should influence navigation and when it should be suppressed or revisited. We conduct extensive experiments on HM3D and MP3D, where ConsistNav achieves state-of-the-art results among compared zero-shot ObjectNav methods and improves SR by 11.4% and SPL by 7.9% over the controlled baseline on MP3D. Ablation studies and real-world deployment experiments further demonstrate the effectiveness and robustness of the proposed executive mechanism.

Problem

Research questions and friction points this paper is trying to address.

zero-shot object navigation

action consistency gap

semantic evidence

object pursuit

navigation stability

Innovation

Methods, ideas, or system contributions that make the work stand out.

action consistency gap

semantic executive control

zero-shot object navigation