MOSU: Autonomous Long-range Robot Navigation with Multi-modal Scene Understanding

📅 2025-07-07

📈 Citations: 0

✨ Influential: 0

career value

268K/year

🤖 AI Summary

To address insufficient robustness in global path planning for outdoor long-range autonomous navigation, this paper proposes a navigation framework integrating multimodal perception and road-scene understanding. Methodologically, it introduces vision-language models (VLMs) into robotic navigation for the first time, jointly leveraging LiDAR-based geometric modeling, image semantic segmentation, and GPS/QGIS map data to co-model geometric, semantic, and socially normative contexts—enabling human-aligned trajectory generation and local optimization. The key contribution is a VLM-driven social-context understanding mechanism, which significantly improves traversability assessment in complex road environments. Experiments on real-world deployments and the GND dataset demonstrate a 10% improvement in traversability while maintaining navigation distance comparable to state-of-the-art methods, validating that multimodal fusion substantially enhances robustness for outdoor long-range navigation.

Technology Category

Application Category

📝 Abstract

We present MOSU, a novel autonomous long-range navigation system that enhances global navigation for mobile robots through multimodal perception and on-road scene understanding. MOSU addresses the outdoor robot navigation challenge by integrating geometric, semantic, and contextual information to ensure comprehensive scene understanding. The system combines GPS and QGIS map-based routing for high-level global path planning and multi-modal trajectory generation for local navigation refinement. For trajectory generation, MOSU leverages multi-modalities: LiDAR-based geometric data for precise obstacle avoidance, image-based semantic segmentation for traversability assessment, and Vision-Language Models (VLMs) to capture social context and enable the robot to adhere to social norms in complex environments. This multi-modal integration improves scene understanding and enhances traversability, allowing the robot to adapt to diverse outdoor conditions. We evaluate our system in real-world on-road environments and benchmark it on the GND dataset, achieving a 10% improvement in traversability on navigable terrains while maintaining a comparable navigation distance to existing global navigation methods.

Problem

Research questions and friction points this paper is trying to address.

Enhances global navigation with multimodal perception

Integrates geometric, semantic, and contextual scene understanding

Improves traversability and social norm adherence in outdoor navigation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates GPS and QGIS for global path planning

Uses LiDAR and semantic segmentation for obstacle avoidance

Leverages Vision-Language Models for social context understanding

🔎 Similar Papers

Intelligent LiDAR Navigation: Leveraging External Information and Semantic Maps with LLM as Copilot