🤖 AI Summary
Existing visual-language navigation (VLN) approaches rely on pre-built maps, candidate viewpoints, or explicit topological graphs, limiting adaptability in unknown environments and hindering true zero-shot generalization. Method: We propose an end-to-end zero-shot VLN framework that jointly learns monocular vision–language representations via CLIP, integrating semantic-driven path generation with online decision control—eliminating dependence on candidate views, node graphs, or prior maps. The framework enables real-time active navigation directly from natural language instructions. Contribution/Results: Deployed on our custom UGV platform, Rover Master, it achieves simultaneous autonomous exploration and target localization in completely unseen environments. It demonstrates robust obstacle avoidance, efficient trajectory generation, and performance competitive with prior-knowledge-dependent methods—significantly outperforming blind traversal baselines. Our core contribution is the first zero-shot, end-to-end, perception–decision-integrated VLN paradigm.
📝 Abstract
Vision-language navigation (VLN) has emerged as a promising paradigm, enabling mobile robots to perform zero-shot inference and execute tasks without specific pre-programming. However, current systems often separate map exploration and path planning, with exploration relying on inefficient algorithms due to limited (partially observed) environmental information. In this paper, we present a novel navigation pipeline named ''ClipRover'' for simultaneous exploration and target discovery in unknown environments, leveraging the capabilities of a vision-language model named CLIP. Our approach requires only monocular vision and operates without any prior map or knowledge about the target. For comprehensive evaluations, we design the functional prototype of a UGV (unmanned ground vehicle) system named ''Rover Master'', a customized platform for general-purpose VLN tasks. We integrate and deploy the ClipRover pipeline on Rover Master to evaluate its throughput, obstacle avoidance capability, and trajectory performance across various real-world scenarios. Experimental results demonstrate that ClipRover consistently outperforms traditional map traversal algorithms and achieves performance comparable to path-planning methods that depend on prior map and target knowledge. Notably, ClipRover offers real-time active navigation without requiring pre-captured candidate images or pre-built node graphs, addressing key limitations of existing VLN pipelines.