PathPainter: Transferring the Generalization Ability of Image Generation Models to Embodied Navigation

📅 2026-05-08

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

This work addresses the challenge of enabling long-range navigation for ground and near-ground robots by effectively leveraging global prior information from bird’s-eye-view (BEV) maps. The authors propose a novel framework that integrates natural language instructions with foundational image generation models: a large vision-language model interprets human commands, identifies target locations, and generates traversable area masks, while cross-view localization aligns the robot’s odometry with the BEV map to guide a conventional local motion planner. This approach uniquely transfers the world knowledge and generalization capabilities of image generation models to embodied navigation tasks. Demonstrating its efficacy, the system achieves successful outdoor drone navigation over distances up to 160 meters using only standard local planning components, highlighting its strong generalization and practical applicability.

📝 Abstract

Bird's-eye-view (BEV) images have been widely demonstrated to provide valuable prior information for navigation. Given the global information provided by such views, two key challenges remain: how to fully exploit this information and how to reliably use it during execution. In this paper, we propose a navigation system that uses BEV images as global priors and is designed for ground and near-ground robotic platforms. The system employs an image generation model to interpret human intent from natural language, identify the target destination, and generate traversability masks. During execution, we introduce cross-view localization to align the robot's odometry with the BEV map and mitigate long-term drift in conventional odometry. We conduct extensive benchmark experiments to evaluate the proposed method and further validate it on a UAV platform. Using only a conventional local motion planner, the UAV successfully completes a 160-meter outdoor long-range navigation task. This work demonstrates how the world-understanding capabilities of foundation models can be transferred to embodied navigation, enabling robots to benefit from the strong generalization ability of existing image generation models.

Problem

Research questions and friction points this paper is trying to address.

embodied navigation

bird's-eye-view

global prior

traversability

long-range navigation

Innovation

Methods, ideas, or system contributions that make the work stand out.

embodied navigation

bird's-eye-view (BEV)

foundation models