🤖 AI Summary
To address insufficient robustness in global path planning for outdoor long-range autonomous navigation, this paper proposes a navigation framework integrating multimodal perception and road-scene understanding. Methodologically, it introduces vision-language models (VLMs) into robotic navigation for the first time, jointly leveraging LiDAR-based geometric modeling, image semantic segmentation, and GPS/QGIS map data to co-model geometric, semantic, and socially normative contexts—enabling human-aligned trajectory generation and local optimization. The key contribution is a VLM-driven social-context understanding mechanism, which significantly improves traversability assessment in complex road environments. Experiments on real-world deployments and the GND dataset demonstrate a 10% improvement in traversability while maintaining navigation distance comparable to state-of-the-art methods, validating that multimodal fusion substantially enhances robustness for outdoor long-range navigation.
📝 Abstract
We present MOSU, a novel autonomous long-range navigation system that enhances global navigation for mobile robots through multimodal perception and on-road scene understanding. MOSU addresses the outdoor robot navigation challenge by integrating geometric, semantic, and contextual information to ensure comprehensive scene understanding. The system combines GPS and QGIS map-based routing for high-level global path planning and multi-modal trajectory generation for local navigation refinement. For trajectory generation, MOSU leverages multi-modalities: LiDAR-based geometric data for precise obstacle avoidance, image-based semantic segmentation for traversability assessment, and Vision-Language Models (VLMs) to capture social context and enable the robot to adhere to social norms in complex environments. This multi-modal integration improves scene understanding and enhances traversability, allowing the robot to adapt to diverse outdoor conditions. We evaluate our system in real-world on-road environments and benchmark it on the GND dataset, achieving a 10% improvement in traversability on navigable terrains while maintaining a comparable navigation distance to existing global navigation methods.