TagaVLM: Topology-Aware Global Action Reasoning for Vision-Language Navigation

📅 2026-03-03

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

This work addresses the limitations of large vision-language models (VLMs) in vision-and-language navigation tasks, where architectural mismatches with dynamic embodied environments often result in poor understanding of topological relationships and weak global action reasoning. To overcome these challenges, the authors propose TagaVLM, an end-to-end framework that innovatively integrates topological edge information into the VLM’s self-attention mechanism. Specifically, they introduce Spatial Topology-Aware Residual Attention (STAR-Att) and interleaved navigation prompts to explicitly enhance spatial reasoning and strengthen node-level alignment between visual and textual representations. Evaluated on the R2R benchmark, TagaVLM achieves state-of-the-art performance with a success rate (SR) of 51.09% and a success weighted by path length (SPL) of 47.18%, demonstrating that structural enhancement strategies outperform mere model scaling.

Technology Category

Application Category

📝 Abstract

Vision-Language Navigation (VLN) presents a unique challenge for Large Vision-Language Models (VLMs) due to their inherent architectural mismatch: VLMs are primarily pretrained on static, disembodied vision-language tasks, which fundamentally clash with the dynamic, embodied, and spatially-structured nature of navigation. Existing large-model-based methods often resort to converting rich visual and spatial information into text, forcing models to implicitly infer complex visual-topological relationships or limiting their global action capabilities. To bridge this gap, we propose TagaVLM (Topology-Aware Global Action reasoning), an end-to-end framework that explicitly injects topological structures into the VLM backbone. To introduce topological edge information, Spatial Topology Aware Residual Attention (STAR-Att) directly integrates it into the VLM's self-attention mechanism, enabling intrinsic spatial reasoning while preserving pretrained knowledge. To enhance topological node information, an Interleaved Navigation Prompt strengthens node-level visual-text alignment. Finally, with the embedded topological graph, the model is capable of global action reasoning, allowing for robust path correction. On the R2R benchmark, TagaVLM achieves state-of-the-art performance among large-model-based methods, with a Success Rate (SR) of 51.09% and SPL of 47.18 in unseen environments, outperforming prior work by 3.39% in SR and 9.08 in SPL. This demonstrates that, for embodied spatial reasoning, targeted enhancements on smaller open-source VLMs can be more effective than brute-force model scaling. The code will be released upon publication.Project page: https://apex-bjut.github.io/Taga-VLM

Problem

Research questions and friction points this paper is trying to address.

Vision-Language Navigation

Large Vision-Language Models

Topological Reasoning

Embodied AI

Spatial Structure

Innovation

Methods, ideas, or system contributions that make the work stand out.

Topology-Aware

Global Action Reasoning

Vision-Language Navigation