๐ค AI Summary
Current end-to-end autonomous driving systems struggle to effectively integrate global navigation information due to their heavy reliance on local perception, leading to inadequate route-following performance in complex scenarios. This work proposes the Sequential Navigation Guidance (SNG) framework, which systematically revealsโfor the first timeโthe critical role of navigation understanding in end-to-end driving. To support this, the authors introduce the SNG-QA dataset, constructed from real-world navigation trajectories paired with turn-by-turn instructions. Building upon this foundation, they design the multimodal SNG-VLA model, which aligns global navigation intent with local trajectory planning without requiring auxiliary perception-based losses. Experimental results demonstrate that SNG-VLA achieves state-of-the-art performance in both navigation comprehension and trajectory planning, significantly improving navigation accuracy in challenging environments.
๐ Abstract
Global navigation information and local scene understanding are two crucial components of autonomous driving systems. However, our experimental results indicate that many end-to-end autonomous driving systems tend to over-rely on local scene understanding while failing to utilize global navigation information. These systems exhibit weak correlation between their planning capabilities and navigation input, and struggle to perform navigation-following in complex scenarios. To overcome this limitation, we propose the Sequential Navigation Guidance (SNG) framework, an efficient representation of global navigation information based on real-world navigation patterns. The SNG encompasses both navigation paths for constraining long-term trajectories and turn-by-turn (TBT) information for real-time decision-making logic. We constructed the SNG-QA dataset, a visual question answering (VQA) dataset based on SNG that aligns global and local planning. Additionally, we introduce an efficient model SNG-VLA that fuses local planning with global planning. The SNG-VLA achieves state-of-the-art performance through precise navigation information modeling without requiring auxiliary loss functions from perception tasks. Project page: SNG-VLA