🤖 AI Summary
This work proposes a zero-shot unified framework for off-road autonomous navigation that eliminates the need for multiple task-specific models, extensive labeled data, and laborious tuning. By leveraging SAM2 for environmental segmentation and a vision-language model (VLM) for multimodal reasoning over raw images and numerically annotated segmentation maps, the system directly infers traversable regions without any terrain-specific training. This approach pioneers the integration of visual prompting and multimodal large language models into off-road scene understanding, enabling end-to-end navigation in a truly zero-shot manner. Experimental results demonstrate that the method outperforms existing trainable models on high-resolution segmentation benchmarks and successfully deploys a complete navigation stack in the Isaac Sim off-road simulation environment.
📝 Abstract
Traditional approaches to off-road autonomy rely on separate models for terrain classification, height estimation, and quantifying slip or slope conditions. Utilizing several models requires training each component separately, having task specific datasets, and fine-tuning. In this work, we present a zero-shot approach leveraging SAM2 for environment segmentation and a vision-language model (VLM) to reason about drivable areas. Our approach involves passing to the VLM both the original image and the segmented image annotated with numeric labels for each mask. The VLM is then prompted to identify which regions, represented by these numeric labels, are drivable. Combined with planning and control modules, this unified framework eliminates the need for explicit terrain-specific models and relies instead on the inherent reasoning capabilities of the VLM. Our approach surpasses state-of-the-art trainable models on high resolution segmentation datasets and enables full stack navigation in our Isaac Sim offroad environment.