POINav: Benchmarking and Enhancing Final-Meters Arrival in Real-World Vision-Language Navigation

📅 2026-05-27

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This work addresses the challenge of precise final-meter navigation to points of interest (POIs) in vision-and-language navigation by proposing a “Brain-in-Action” framework that leverages semantic reasoning about POIs to guide continuous waypoint prediction. To facilitate realistic evaluation, the authors introduce POINav-Bench, the first closed-loop benchmark for real-world POI navigation, built upon a high-fidelity 3D Gaussian Splatting reconstruction of a 120,000-square-meter commercial district encompassing 163 POIs, complete with traversability annotations and reference trajectories. The study also releases a large-scale dataset comprising 70,000 real-world signboard-to-entrance pairs. Experimental results demonstrate the effectiveness of the proposed approach in fine-grained navigation tasks.

📝 Abstract

Real-world navigation is fundamentally driven by Points of Interest (POIs), yet reaching a precise POI remains a critical "final-meters" challenge. Existing Vision-Language Navigation (VLN) benchmarks of POI-goal navigation often suffer from coarse granularity or significant sim-to-real gaps due to generated scene. To bridge this gap, we present POINav-Bench, the first benchmark designed for closed-loop evaluation of real-world POI-goal navigation. It comprises 11 commercial areas reconstructed from real-world captures using 3D Gaussian Splatting (3DGS), covering 126,398 $m^{2}$ in total and spanning 163 distinct POIs. With traversability-aware annotations and reference trajectories, POINav-Bench enables high-fidelity evaluation of navigation agents in realistic, POI-rich real-world environments. Building on this, we propose the POINav Brain-Action Framework where a Brain module performs POI-grounded reasoning to guide an Action module in predicting continuous waypoints for real-world execution. We further curate the POINav-Dataset, containing 70K real-world signage-entrance pairs. Experiments show that our framework provides a viable path toward refining real-world POI-goal navigation.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language Navigation

Point of Interest

final-meters navigation

sim-to-real gap

real-world navigation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language Navigation

Point of Interest (POI)

3D Gaussian Splatting