VLN-Cache: Enabling Token Caching for VLN Models with Visual/Semantic Dynamics Awareness

📅 2026-03-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high computational overhead of existing vision-and-language navigation (VLN) models, which hinders real-time deployment, and the inadequacy of conventional token caching methods that fail in dynamic environments due to their neglect of viewpoint shifts and evolving semantic focus. To overcome these limitations, the paper introduces the first training-free caching framework that jointly models visual dynamics—capturing viewpoint changes—and semantic dynamics—reflecting task-phase progression. The approach leverages view-aligned remapping to recover geometric correspondences, employs a task-relevance saliency filter to detect semantic transitions, and incorporates a hierarchical adaptive entropy strategy to govern cache reuse. Evaluated on the R2R-CE benchmark, the method achieves up to a 1.52× speedup in inference while maintaining competitive navigation success rates.

Technology Category

Application Category

📝 Abstract
Vision-and-Language Navigation (VLN) increasingly relies on large vision-language models, but their inference cost conflicts with real-time deployment. Token caching is a promising training-free strategy that avoids redundant computation by reusing stable visual tokens across frames. However, existing methods assume a static camera and fixed semantic focus, assumptions that VLN fundamentally violates. We identify two failure modes: (1) visual dynamics, where viewpoint shift displaces token positions across frames, causing position-wise matching to pair misaligned content; (2) semantic dynamics, where token relevance shifts across task stages as navigation progresses, making cached states stale. We propose VLN-Cache, a visual-dynamic-aware and semantic-dynamic-aware caching framework that introduces view-aligned remapping to recover geometric correspondences and a task-relevance saliency filter to veto reuse at semantic transitions. A layer-adaptive entropy policy further balances the per-layer reuse budget. Experiments on the R2R-CE simulation benchmark show up to 1.52x speedup while maintaining competitive navigation success rates.
Problem

Research questions and friction points this paper is trying to address.

Vision-and-Language Navigation
Token Caching
Visual Dynamics
Semantic Dynamics
Real-time Deployment
Innovation

Methods, ideas, or system contributions that make the work stand out.

token caching
visual dynamics
semantic dynamics
view-aligned remapping
task-relevance saliency
Zihao Zheng
Zihao Zheng
Peking University
Machine Learning SystemEdge ComputingComputer ArchitectureEDA
Z
Zhihao Mao
School of Computer Science, China University of Geosciences (Wuhan), Wuhan, China
X
Xingyue Zhou
School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, Wuhan, China
Jiayu Chen
Jiayu Chen
PhD student, IFLab@PKU
Efficient Visual GenerationML system
M
Maoliang Li
School of Computer Science, Peking University, Beijing, China
X
Xinhao Sun
School of Electronics Engineering and Computer Science, Peking University, Beijing, China
H
Hailong Zou
School of Computer Science, Peking University, Beijing, China
Z
Zhaobo Zhang
School of Computer Science, Peking University, Beijing, China
Xuanzhe Liu
Xuanzhe Liu
Boya Distinguished Professor, Peking University, ACM Distinguished Scientist
Machine Learning SystemMobile Computing SystemServerless Computing
D
Donggang Cao
School of Computer Science, Peking University, Beijing, China
Hong Mei
Hong Mei
Peking University
Software EngineeringSystem SoftwareData Analytics
X
Xiang Chen
School of Computer Science, Peking University, Beijing, China