OmniVLN: Omnidirectional 3D Perception and Token-Efficient LLM Reasoning for Visual-Language Navigation across Air and Ground Platforms

📅 2026-03-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses critical limitations in current vision-language navigation systems, which suffer from fragmented spatial understanding and delayed target discovery due to narrow fields of view, as well as performance degradation in large language models under long-context conditions. To overcome these challenges, the authors propose a novel framework integrating panoramic 3D perception with a hierarchical, efficient prompting mechanism. This approach leverages hardware-agnostic panoramic mapping, persistent homology–driven topological room segmentation, and hybrid geometric–vision-language relation verification to construct an agent-centric octant-based dynamic scene graph. A multi-resolution spatial attention prompting strategy is further introduced to guide stepwise reasoning in large language models. Experiments demonstrate that the method achieves a spatial reference accuracy of 93.18%, reduces prompt token consumption by 61.7%, and improves navigation success rates by up to 11.68%, substantially outperforming existing baselines.

Technology Category

Application Category

📝 Abstract
Language-guided embodied navigation requires an agent to interpret object-referential instructions, search across multiple rooms, localize the referenced target, and execute reliable motion toward it. Existing systems remain limited in real indoor environments because narrow field-of-view sensing exposes only a partial local scene at each step, often forcing repeated rotations, delaying target discovery, and producing fragmented spatial understanding; meanwhile, directly prompting LLMs with dense 3D maps or exhaustive object lists quickly exceeds the context budget. We present OmniVLN, a zero-shot visual-language navigation framework that couples omnidirectional 3D perception with token-efficient hierarchical reasoning for both aerial and ground robots. OmniVLN fuses a rotating LiDAR and panoramic vision into a hardware-agnostic mapping stack, incrementally constructs a five-layer Dynamic Scene Graph (DSG) from mesh geometry to room- and building-level structure, and stabilizes high-level topology through persistent-homology-based room partitioning and hybrid geometric/VLM relation verification. For navigation, the global DSG is transformed into an agent-centric 3D octant representation with multi-resolution spatial attention prompting, enabling the LLM to progressively filter candidate rooms, infer egocentric orientation, localize target objects, and emit executable navigation primitives while preserving fine local detail and compact long-range memory. Experiments show that the proposed hierarchical interface improves spatial referring accuracy from 77.27\% to 93.18\%, reduces cumulative prompt tokens by up to 61.7\% in cluttered multi-room settings, and improves navigation success by up to 11.68\% over a flat-list baseline. We will release the code and an omnidirectional multimodal dataset to support reproducible research.
Problem

Research questions and friction points this paper is trying to address.

visual-language navigation
narrow field-of-view
context budget
spatial understanding
embodied navigation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Omnidirectional 3D Perception
Dynamic Scene Graph
Token-Efficient LLM Reasoning
Hierarchical Spatial Representation
Visual-Language Navigation