🤖 AI Summary
This work addresses the challenge of zero-shot object navigation in unknown multi-floor environments, where existing methods often struggle to balance exploration and exploitation, leading to local traps or inefficient wandering. To overcome this, we propose AERR-Nav, a novel framework featuring an adaptive three-state mechanism—Explore, Recover, and Recall—integrated with a dual-process cognitive architecture inspired by fast and slow thinking. The approach leverages semantic value for waypoint selection, enhances spatial memory through topological reasoning, and employs a multimodal large language model for high-level planning, enabling dynamic state transitions and coordinated strategy execution. Evaluated on HM3D and MP3D benchmarks, AERR-Nav achieves state-of-the-art zero-shot navigation performance, with ablation studies confirming the contribution of each component and demonstrating significantly improved robustness and efficiency in complex multi-floor scenarios.
📝 Abstract
Zero-Shot Object Navigation (ZSON) in unknown multi-floor environments presents a significant challenge. Recent methods, mostly based on semantic value greedy waypoint selection, spatial topology-enhanced memory, and Multimodal Large Language Model (MLLM) as a decision-making framework, have led to improvements. However, these architectures struggle to balance exploration and exploitation for ZSON when encountering unseen environments, especially in multi-floor settings, such as robots getting stuck at narrow intersections, endlessly wandering, or failing to find stair entrances. To overcome these challenges, we propose AERR-Nav, a Zero-Shot Object Navigation framework that dynamically adjusts its state based on the robot's environment. Specifically, AERR-Nav has the following two key advantages: (1) An Adaptive Exploration-Recovery-Reminiscing Strategy, enables robots to dynamically transition between three states, facilitating specialized responses to diverse navigation scenarios. (2) An Adaptive Exploration State featuring Fast and Slow-Thinking modes helps robots better balance exploration, exploitation, and higher-level reasoning based on evolving environmental information. Extensive experiments on the HM3D and MP3D benchmarks demonstrate that our AERR-Nav achieves state-of-the-art performance among zero-shot methods. Comprehensive ablation studies further validate the efficacy of our proposed strategy and modules.