LookasideVLN: Direction-Aware Aerial Vision-and-Language Navigation

📅 2026-04-18
📈 Citations: 0
Influential: 0
📄 PDF

career value

205K/year
🤖 AI Summary
Existing aerial vision-language navigation methods overly rely on landmark descriptions while neglecting directional cues, resulting in shallow spatial understanding and high computational overhead. This work proposes a novel paradigm that explicitly models directional information inherent in natural language instructions by integrating egocentric lateral gaze (ELG), a spatial landmark knowledge base (SLKB), and a multimodal large language model (MLLM)-based navigation agent. The approach leverages a dynamic graph structure coupled with a lightweight memory mechanism to align multimodal inputs effectively. Evaluated under a single-step lookahead setting, the proposed method significantly outperforms the current state-of-the-art CityNavAgent, demonstrating the efficiency and critical role of directional cues in aerial vision-language navigation.

Technology Category

Application Category

📝 Abstract
Aerial Vision-and-Language Navigation (Aerial VLN) enables unmanned aerial vehicles (UAVs) to follow natural language instructions and navigate complex urban environments. While recent advances have achieved progress through large-scale memory graphs and lookahead path planning, they remain limited by shallow instruction understanding and high computational cost. In particular, existing methods rely primarily on landmark descriptions, overlooking directional cues "a key source of spatial context in human navigation". In this work, we propose LookasideVLN, a new paradigm that exploits directional cues in natural language to achieve both more accurate spatial reasoning and greater computational efficiency. LookasideVLN comprises three core components: (1) an Egocentric Lookaside Graph (ELG) that dynamically encodes instruction-relevant landmarks and their directional relationships, (2) a Spatial Landmark Knowledge Base (SLKB) that provides lightweight memory retrieval from prior navigation experiences, and (3) a Lookaside MLLM Navigation Agent that aligns multimodal information from user instructions, visual observations, and landmark-direction information from ELG for path planning. Extensive experiments show that LookasideVLN significantly outperforms the state-of-the-art CityNavAgent, even with a single-level lookahead, demonstrating that leveraging directional cues is a powerful yet efficient strategy for Aerial VLN.
Problem

Research questions and friction points this paper is trying to address.

Aerial Vision-and-Language Navigation
directional cues
spatial reasoning
instruction understanding
computational efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

directional cues
Egocentric Lookaside Graph
Spatial Landmark Knowledge Base
Aerial Vision-and-Language Navigation
multimodal navigation agent
🔎 Similar Papers