CL-CoTNav: Closed-Loop Hierarchical Chain-of-Thought for Zero-Shot Object-Goal Navigation with Vision-Language Models

📅 2025-04-11

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Weak zero-shot generalization of Visual Object Navigation (ObjectNav) to unseen environments and novel object categories stems primarily from the lack of structured reasoning in end-to-end approaches. To address this, we propose a Vision-Language Model (VLM)-driven closed-loop Hierarchical Chain-of-Thought (CoT) framework. It enables dynamic decision-making via adaptive confidence-weighted integration of detection and reasoning modules; introduces a multi-turn question-answering dataset of human demonstrations to support cognition-inspired perception-reasoning co-optimization; and combines hierarchical CoT prompting, VLM fine-tuning, and AI Habitat-based simulation training. Experiments demonstrate substantial improvements over state-of-the-art methods on zero-shot ObjectNav: Success Rate (SR) and Success-weighted by Path Length (SPL) increase by 22.4%. We publicly release our dataset, models, and demonstration videos.

Technology Category

Application Category

📝 Abstract

Visual Object Goal Navigation (ObjectNav) requires a robot to locate a target object in an unseen environment using egocentric observations. However, decision-making policies often struggle to transfer to unseen environments and novel target objects, which is the core generalization problem. Traditional end-to-end learning methods exacerbate this issue, as they rely on memorizing spatial patterns rather than employing structured reasoning, limiting their ability to generalize effectively. In this letter, we introduce Closed-Loop Hierarchical Chain-of-Thought Navigation (CL-CoTNav), a vision-language model (VLM)-driven ObjectNav framework that integrates structured reasoning and closed-loop feedback into navigation decision-making. To enhance generalization, we fine-tune a VLM using multi-turn question-answering (QA) data derived from human demonstration trajectories. This structured dataset enables hierarchical Chain-of-Thought (H-CoT) prompting, systematically extracting compositional knowledge to refine perception and decision-making, inspired by the human cognitive process of locating a target object through iterative reasoning steps. Additionally, we propose a Closed-Loop H-CoT mechanism that incorporates detection and reasoning confidence scores into training. This adaptive weighting strategy guides the model to prioritize high-confidence data pairs, mitigating the impact of noisy inputs and enhancing robustness against hallucinated or incorrect reasoning. Extensive experiments in the AI Habitat environment demonstrate CL-CoTNav's superior generalization to unseen scenes and novel object categories. Our method consistently outperforms state-of-the-art approaches in navigation success rate (SR) and success weighted by path length (SPL) by 22.4%. We release our datasets, models, and supplementary videos on our project page.

Problem

Research questions and friction points this paper is trying to address.

Improves generalization in unseen environments for ObjectNav

Enhances decision-making with structured reasoning and feedback

Mitigates noisy inputs via confidence-based adaptive weighting

Innovation

Methods, ideas, or system contributions that make the work stand out.

Closed-loop hierarchical Chain-of-Thought navigation framework

Fine-tuned VLM with multi-turn QA data

Adaptive weighting using confidence scores

🔎 Similar Papers

Open-Nav: Exploring Zero-Shot Vision-and-Language Navigation in Continuous Environment with Open-Source LLMs