GA-VLN: Geometry-Aware BEV Representation for Efficient Vision-Language Navigation

📅 2026-05-21
📈 Citations: 0
Influential: 0
📄 PDF

career value

212K/year
🤖 AI Summary
This work addresses the limitations of existing vision-and-language navigation methods, which rely on dense RGB video inputs that introduce high redundancy and lack explicit spatial structure, leading to excessive computational costs and constrained spatial reasoning capabilities. To overcome these issues, the authors propose a Geometry-Aware Bird’s-Eye-View (GA-BEV) representation that projects RGB-D inputs into 3D space to construct an agent-centric, compact BEV layout. Their approach uniquely integrates explicit geometric projection with implicit structural priors derived from pretrained 3D foundation models. Notably, the method achieves state-of-the-art performance using only pure navigation data—without requiring DAgger or hybrid VQA-style training—and demonstrates significantly enhanced spatial understanding and data efficiency.
📝 Abstract
Despite significant progress in Vision-Language Navigation (VLN), existing approaches still rely on dense RGB videos that produce excessive patch tokens and lack explicit spatial structure, resulting in substantial computational overhead and limited spatial reasoning. To address these issues, we introduce the Geometry-Aware BEV (GA-BEV) - a compact, 3D-grounded feature representation that integrates both explicit and implicit geometric cues into multimodal large language model (MLLM) - based navigation systems. We construct BEV spatial maps from RGB-D inputs by projecting visual features into 3D space and aggregating them into an agent-centric layout that preserves geometric consistency while reducing token redundancy. To further enrich geometric understanding, we incorporate features from a pretrained 3D foundation model into the BEV space, injecting structural priors learned from large-scale 3D reconstruction tasks. Together, these complementary cues - explicit depth-based projection and implicit learned priors - yield compact yet spatially expressive representations that substantially improve navigation efficiency and performance. Experiments show that our method achieves state-of-the-art results using only navigation data, without DAgger augmentation or mixed VQA training, demonstrating the robustness and data efficiency of the proposed GA-VLN framework.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language Navigation
BEV representation
spatial reasoning
computational overhead
geometric understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Geometry-Aware BEV
Vision-Language Navigation
3D Foundation Model
Bird's-Eye-View Representation
Multimodal Large Language Model