🤖 AI Summary
Visual-language navigation (VLN) in map-free, dynamic environments remains challenging due to the need for real-time adaptability, robust grounding of language instructions, and safe physical execution.
Method: We propose a hierarchical architecture synergizing large models (for reflective, chain-of-thought reasoning with panoramic visual prompting) and small models (for reactive planning via causal learning–driven dual-branch perception). An uncertainty-aware mechanism dynamically fuses decisions from both modules. Additionally, we integrate a learnable point-goal policy with a real-time LiDAR-SLAM–based obstacle avoidance module.
Contributions/Results: This work introduces the first hierarchical large–small model collaboration framework for VLN. It achieves end-to-end deployability from simulation to real-world robots. On the VLN-CE benchmark, it ranks #1, achieving significant improvements in success rate (SR) and success-weighted by path length (SPL) over prior state-of-the-art—especially under unseen environments. Real-robot experiments demonstrate strong robustness and cross-scenario generalization capability.
📝 Abstract
Vision-and-Language Navigation (VLN) requires robots to follow natural language instructions and navigate complex environments without prior maps. While recent vision-language large models demonstrate strong reasoning abilities, they often underperform task-specific panoramic small models in VLN tasks. To address this, we propose CLASH (Collaborative Large-Small Hierarchy), a VLN-CE framework that integrates a reactive small-model planner (RSMP) with a reflective large-model reasoner (RLMR). RSMP adopts a causal-learning-based dual-branch architecture to enhance generalization, while RLMR leverages panoramic visual prompting with chain-of-thought reasoning to support interpretable spatial understanding and navigation. We further introduce an uncertainty-aware collaboration mechanism (UCM) that adaptively fuses decisions from both models. For obstacle avoidance, in simulation, we replace the rule-based controller with a fully learnable point-goal policy, and in real-world deployment, we design a LiDAR-based clustering module for generating navigable waypoints and pair it with an online SLAM-based local controller. CLASH achieves state-of-the-art (SoTA) results (ranking 1-st) on the VLN-CE leaderboard, significantly improving SR and SPL on the test-unseen set over the previous SoTA methods. Real-world experiments demonstrate CLASH's strong robustness, validating its effectiveness in both simulation and deployment scenarios.