🤖 AI Summary
Weak zero-shot generalization of Visual Object Navigation (ObjectNav) to unseen environments and novel object categories stems primarily from the lack of structured reasoning in end-to-end approaches. To address this, we propose a Vision-Language Model (VLM)-driven closed-loop Hierarchical Chain-of-Thought (CoT) framework. It enables dynamic decision-making via adaptive confidence-weighted integration of detection and reasoning modules; introduces a multi-turn question-answering dataset of human demonstrations to support cognition-inspired perception-reasoning co-optimization; and combines hierarchical CoT prompting, VLM fine-tuning, and AI Habitat-based simulation training. Experiments demonstrate substantial improvements over state-of-the-art methods on zero-shot ObjectNav: Success Rate (SR) and Success-weighted by Path Length (SPL) increase by 22.4%. We publicly release our dataset, models, and demonstration videos.
📝 Abstract
Visual Object Goal Navigation (ObjectNav) requires a robot to locate a target object in an unseen environment using egocentric observations. However, decision-making policies often struggle to transfer to unseen environments and novel target objects, which is the core generalization problem. Traditional end-to-end learning methods exacerbate this issue, as they rely on memorizing spatial patterns rather than employing structured reasoning, limiting their ability to generalize effectively. In this letter, we introduce Closed-Loop Hierarchical Chain-of-Thought Navigation (CL-CoTNav), a vision-language model (VLM)-driven ObjectNav framework that integrates structured reasoning and closed-loop feedback into navigation decision-making. To enhance generalization, we fine-tune a VLM using multi-turn question-answering (QA) data derived from human demonstration trajectories. This structured dataset enables hierarchical Chain-of-Thought (H-CoT) prompting, systematically extracting compositional knowledge to refine perception and decision-making, inspired by the human cognitive process of locating a target object through iterative reasoning steps. Additionally, we propose a Closed-Loop H-CoT mechanism that incorporates detection and reasoning confidence scores into training. This adaptive weighting strategy guides the model to prioritize high-confidence data pairs, mitigating the impact of noisy inputs and enhancing robustness against hallucinated or incorrect reasoning. Extensive experiments in the AI Habitat environment demonstrate CL-CoTNav's superior generalization to unseen scenes and novel object categories. Our method consistently outperforms state-of-the-art approaches in navigation success rate (SR) and success weighted by path length (SPL) by 22.4%. We release our datasets, models, and supplementary videos on our project page.