CLASH: Collaborative Large-Small Hierarchical Framework for Continuous Vision-and-Language Navigation

📅 2025-12-11

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Visual-language navigation (VLN) in map-free, dynamic environments remains challenging due to the need for real-time adaptability, robust grounding of language instructions, and safe physical execution. Method: We propose a hierarchical architecture synergizing large models (for reflective, chain-of-thought reasoning with panoramic visual prompting) and small models (for reactive planning via causal learning–driven dual-branch perception). An uncertainty-aware mechanism dynamically fuses decisions from both modules. Additionally, we integrate a learnable point-goal policy with a real-time LiDAR-SLAM–based obstacle avoidance module. Contributions/Results: This work introduces the first hierarchical large–small model collaboration framework for VLN. It achieves end-to-end deployability from simulation to real-world robots. On the VLN-CE benchmark, it ranks #1, achieving significant improvements in success rate (SR) and success-weighted by path length (SPL) over prior state-of-the-art—especially under unseen environments. Real-robot experiments demonstrate strong robustness and cross-scenario generalization capability.

Technology Category

Application Category

📝 Abstract

Vision-and-Language Navigation (VLN) requires robots to follow natural language instructions and navigate complex environments without prior maps. While recent vision-language large models demonstrate strong reasoning abilities, they often underperform task-specific panoramic small models in VLN tasks. To address this, we propose CLASH (Collaborative Large-Small Hierarchy), a VLN-CE framework that integrates a reactive small-model planner (RSMP) with a reflective large-model reasoner (RLMR). RSMP adopts a causal-learning-based dual-branch architecture to enhance generalization, while RLMR leverages panoramic visual prompting with chain-of-thought reasoning to support interpretable spatial understanding and navigation. We further introduce an uncertainty-aware collaboration mechanism (UCM) that adaptively fuses decisions from both models. For obstacle avoidance, in simulation, we replace the rule-based controller with a fully learnable point-goal policy, and in real-world deployment, we design a LiDAR-based clustering module for generating navigable waypoints and pair it with an online SLAM-based local controller. CLASH achieves state-of-the-art (SoTA) results (ranking 1-st) on the VLN-CE leaderboard, significantly improving SR and SPL on the test-unseen set over the previous SoTA methods. Real-world experiments demonstrate CLASH's strong robustness, validating its effectiveness in both simulation and deployment scenarios.

Problem

Research questions and friction points this paper is trying to address.

Integrates large and small models for better vision-language navigation.

Enhances generalization and interpretability in complex environments.

Improves real-world robustness with adaptive collaboration mechanisms.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical framework integrating large and small models

Uncertainty-aware collaboration mechanism for adaptive decision fusion

Learnable point-goal policy and LiDAR-based obstacle avoidance

🔎 Similar Papers

Vision-and-Language Navigation Today and Tomorrow: A Survey in the Era of Foundation Models