Walk and Read Less: Improving the Efficiency of Vision-and-Language Navigation via Tuning-Free Multimodal Token Pruning

📅 2025-09-17

📈 Citations: 0

✨ Influential: 0

career value

161K/year

🤖 AI Summary

Vision-and-language navigation (VLN) suffers from high computational overhead in large language model (LLM) inference, while existing token pruning methods neglect the adverse impact of increased path length on practical efficiency. Method: We propose Navigation-Aware Pruning (NAP), a fine-tuning-free approach that jointly prunes multimodal tokens using two criteria—navigational feasibility (based on viewpoint traversability) and instruction relevance (extracted via LLMs)—to preserve foreground (navigation-critical) regions, selectively prune background tokens, and suppress spurious backtracking nodes. NAP thus balances semantic fidelity with path optimality. Contribution/Results: Evaluated on standard VLN benchmarks, NAP improves task success rate while reducing FLOPS by over 50%, significantly outperforming prior pruning methods in both efficiency and effectiveness.

Technology Category

Application Category

📝 Abstract

Large models achieve strong performance on Vision-and-Language Navigation (VLN) tasks, but are costly to run in resource-limited environments. Token pruning offers appealing tradeoffs for efficiency with minimal performance loss by reducing model input size, but prior work overlooks VLN-specific challenges. For example, information loss from pruning can effectively increase computational cost due to longer walks. Thus, the inability to identify uninformative tokens undermines the supposed efficiency gains from pruning. To address this, we propose Navigation-Aware Pruning (NAP), which uses navigation-specific traits to simplify the pruning process by pre-filtering tokens into foreground and background. For example, image views are filtered based on whether the agent can navigate in that direction. We also extract navigation-relevant instructions using a Large Language Model. After filtering, we focus pruning on background tokens, minimizing information loss. To further help avoid increases in navigation length, we discourage backtracking by removing low-importance navigation nodes. Experiments on standard VLN benchmarks show NAP significantly outperforms prior work, preserving higher success rates while saving more than 50% FLOPS.

Problem

Research questions and friction points this paper is trying to address.

Reducing computational costs in Vision-and-Language Navigation models

Minimizing information loss during token pruning process

Preventing increased navigation length from inefficient pruning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Navigation-aware pruning for VLN efficiency

Pre-filter tokens into foreground and background

Extract navigation-relevant instructions using LLM

🔎 Similar Papers

Sparsity Meets Similarity: Leveraging Long-Tail Distribution for Dynamic Optimized Token Representation in Multimodal Large Language Models