Aux-Think: Exploring Reasoning Strategies for Data-Efficient Vision-Language Navigation

📅 2025-05-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Visual-language navigation (VLN) suffers from poorly understood reasoning strategies and frequent reasoning collapse during inference under data-efficient settings. Method: We propose Aux-Think—a novel framework that internalizes structured, chain-of-thought (CoT)-guided reasoning via supervised multimodal training, and decouples “reasoning” from “action” at inference time to directly output navigation decisions. Contributions/Results: (1) We systematically identify and characterize the reasoning collapse phenomenon in VLN for the first time; (2) we introduce a training–inference decoupling paradigm; and (3) we construct R2R-CoT-320k—the first large-scale, CoT-annotated dataset for VLN. Experiments demonstrate that Aux-Think achieves significantly higher navigation accuracy and reduced training cost compared to baselines under equivalent data budgets, empirically validating the superiority of internalized reasoning over online reasoning.

Technology Category

Application Category

📝 Abstract
Vision-Language Navigation (VLN) is a critical task for developing embodied agents that can follow natural language instructions to navigate in complex real-world environments. Recent advances in VLN by large pretrained models have significantly improved generalization and instruction grounding compared to traditional approaches. However, the role of reasoning strategies in navigation-an action-centric, long-horizon task-remains underexplored, despite Chain-of-Thought (CoT) reasoning's demonstrated success in static tasks like visual question answering. To address this gap, we conduct the first systematic evaluation of reasoning strategies for VLN, including No-Think (direct action prediction), Pre-Think (reason before action), and Post-Think (reason after action). Surprisingly, our findings reveal the Inference-time Reasoning Collapse issue, where inference-time reasoning degrades navigation accuracy, highlighting the challenges of integrating reasoning into VLN. Based on this insight, we propose Aux-Think, a framework that trains models to internalize structured reasoning patterns through CoT supervision, while inferring action directly without reasoning in online prediction. To support this framework, we release R2R-CoT-320k, the first Chain-of-Thought annotated dataset for VLN. Extensive experiments show that Aux-Think reduces training effort greatly and achieves the best performance under the same data scale.
Problem

Research questions and friction points this paper is trying to address.

Evaluating reasoning strategies for Vision-Language Navigation (VLN)
Addressing Inference-time Reasoning Collapse in VLN tasks
Proposing Aux-Think to internalize reasoning without online prediction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates reasoning strategies for Vision-Language Navigation
Proposes Aux-Think framework with CoT supervision
Introduces R2R-CoT-320k dataset for VLN
🔎 Similar Papers
No similar papers found.
S
Shuo Wang
Renmin University of China
Y
Yongcai Wang
Renmin University of China
W
Wanting Li
Renmin University of China
Xudong Cai
Xudong Cai
Renmin University of China
computer visioncamera localizationSLAM
Yucheng Wang
Yucheng Wang
ETH Zürich
Multimodal LLMSpeech UnderstandingHuman-Computer Interaction
M
Maiyue Chen
Horizon Robotics
K
Kaihui Wang
Horizon Robotics
Zhizhong Su
Zhizhong Su
Horizon Robotics
Deep LearningComputer VisionAutonomous DrivingRobotics Learning
D
Deying Li
Renmin University of China
Z
Zhaoxin Fan
Beijing Academy of Blockchain and Edge Computing