🤖 AI Summary
To address the spatial alignment difficulty in embodied navigation caused by ambiguity and verbosity of natural language instructions, this paper proposes Visual Prompt Navigation (VPN)—a novel paradigm where users directly annotate visual trajectories on a 2D top-down map as navigation guidance, thereby bypassing linguistic ambiguity. We introduce the first benchmark datasets tailored for visual prompting: R2R-VP and R2R-CE-VP, along with a joint viewpoint-level and trajectory-level data augmentation strategy. We further propose VPNet, a unified architecture capable of interpreting visual prompts and predicting paths in both discrete and continuous navigation settings. Extensive experiments validate the effectiveness of diverse prompt formats, map representations, and augmentation techniques, yielding significant performance gains across multiple metrics. The code and datasets are publicly released.
📝 Abstract
While natural language is commonly used to guide embodied agents, the inherent ambiguity and verbosity of language often hinder the effectiveness of language-guided navigation in complex environments. To this end, we propose Visual Prompt Navigation (VPN), a novel paradigm that guides agents to navigate using only user-provided visual prompts within 2D top-view maps. This visual prompt primarily focuses on marking the visual navigation trajectory on a top-down view of a scene, offering intuitive and spatially grounded guidance without relying on language instructions. It is more friendly for non-expert users and reduces interpretive ambiguity. We build VPN tasks in both discrete and continuous navigation settings, constructing two new datasets, R2R-VP and R2R-CE-VP, by extending existing R2R and R2R-CE episodes with corresponding visual prompts. Furthermore, we introduce VPNet, a dedicated baseline network to handle the VPN tasks, with two data augmentation strategies: view-level augmentation (altering initial headings and prompt orientations) and trajectory-level augmentation (incorporating diverse trajectories from large-scale 3D scenes), to enhance navigation performance. Extensive experiments evaluate how visual prompt forms, top-view map formats, and data augmentation strategies affect the performance of visual prompt navigation. The code is available at https://github.com/farlit/VPN.