🤖 AI Summary
This work addresses the challenge of translating natural language instructions into safe, long-horizon, and socially compliant robot navigation behaviors in open outdoor environments. It proposes a hierarchical framework that operates without high-definition maps: a high-level vision-language model leverages GPS coordinates and public map APIs to interpret semantic intent and generate coarse routes, while a low-level vision-language action policy executes standard navigation and activates a safety-aware waiting mechanism in complex scenarios through explicit safety reasoning. This approach achieves, for the first time, long-range outdoor social navigation without reliance on pre-built high-definition maps. By effectively integrating semantic understanding, safety perception, and hierarchical control in real-world settings, the method significantly enhances the robustness of robots in executing natural language commands.
📝 Abstract
Assisting humans in open-world outdoor environments requires robots to translate high-level natural-language intentions into safe, long-horizon, and socially compliant navigation behavior. Existing map-based methods rely on costly pre-built HD maps, while learning-based policies are mostly limited to indoor and short-horizon settings. To bridge this gap, we propose Walk with Me, a map-free framework for long-horizon social navigation from high-level human instructions. Walk with Me leverages GPS context and lightweight candidate points-of-interest from a public map API for semantic destination grounding and waypoint proposal. A High-Level Vision-Language Model grounds abstract instructions into concrete destinations and plans coarse waypoint sequences. During execution, an observation-aware routing mechanism determines whether the Low-Level Vision-Language-Action policy can handle the current situation or whether explicit safety reasoning from the High-Level VLM is needed. Routine segments are executed by the Low-Level VLA, while complex situations such as crowded crossings trigger high-level reasoning and stop-and-wait behavior when unsafe. By combining semantic intent grounding, map-free long-horizon planning, safety-aware reasoning, and low-level action generation, Walk with Me enables practical outdoor social navigation for human-centric assistance.