🤖 AI Summary
This work addresses the challenges of cognitive overload in single-agent visual-language navigation and high coordination costs in multi-agent systems, particularly the issues of instruction drift and reasoning degradation in long-horizon tasks. To this end, the authors propose DACo, a dual-agent framework that decouples global planning from local execution: a Global Commander generates high-level strategies and dynamic sub-goals, while a Local Operative focuses on egocentric perception and fine-grained action control, augmented by an adaptive replanning mechanism. The architecture seamlessly integrates both closed-source (e.g., GPT-4o) and open-source (e.g., Qwen-VL) multimodal large language models. Evaluated in zero-shot settings on the R2R, REVERIE, and R4R benchmarks, DACo outperforms the strongest baselines by 4.9%, 6.5%, and 5.4%, respectively, significantly enhancing stability, generalization, and robustness in long-horizon vision-and-language navigation.
📝 Abstract
Vision-and-Language Scene navigation is a fundamental capability for embodied human-AI collaboration, requiring agents to follow natural language instructions to execute coherent action sequences in complex environments. Existing approaches either rely on multiple agents, incurring high coordination and resource costs, or adopt a single-agent paradigm, which overloads the agent with both global planning and local perception, often leading to degraded reasoning and instruction drift in long-horizon settings. To address these issues, we introduce DACo, a planning-grounding decoupled architecture that disentangles global deliberation from local grounding. Concretely, it employs a Global Commander for high-level strategic planning and a Local Operative for egocentric observing and fine-grained execution. By disentangling global reasoning from local action, DACo alleviates cognitive overload and improves long-horizon stability. The framework further integrates dynamic subgoal planning and adaptive replanning to enable structured and resilient navigation. Extensive evaluations on R2R, REVERIE, and R4R demonstrate that DACo achieves 4.9%, 6.5%, 5.4% absolute improvements over the best-performing baselines in zero-shot settings, and generalizes effectively across both closed-source (e.g., GPT-4o) and open-source (e.g., Qwen-VL Series) backbones. DACo provides a principled and extensible paradigm for robust long-horizon navigation. Project page: https://github.com/ChocoWu/DACo