FreqCache: Accelerating Embodied VLN Models with Adaptive Frequency-Guided Token Caching

📅 2026-04-27
📈 Citations: 0
Influential: 0
📄 PDF

career value

210K/year
🤖 AI Summary
This work addresses the limitations of existing token caching methods in vision-and-language navigation (VLN), which struggle to balance efficiency and performance due to viewpoint variations, loss of critical edge information, and rigid cache budgets. To overcome these challenges, we introduce frequency-domain analysis into VLN token caching for the first time, leveraging its viewpoint invariance and structural interpretability. We propose a training-free, adaptive caching framework that dynamically optimizes cache construction, refreshment, and budget allocation. This approach effectively preserves essential visual edge features and mitigates viewpoint shift effects, achieving a 1.59× inference speedup with negligible computational overhead, thereby significantly enhancing the inference efficiency of VLN models.

Technology Category

Application Category

📝 Abstract
Vision-Language-Navigation (VLN) models exhibit excellent navigation accuracy but incur high computational overhead. Token caching has emerged as a promising training-free strategy to reduce this cost by reusing token computation results; however, existing token caching approaches rely on visual domain methods for cacheable token selection, leading to challenges when adapted to VLN models. 1) Visual domain methods become invalid when there is viewpoint migration. 2) Visual domain methods neglect critical edge information without the aid of additional algorithms. 3) Visual domain methods overlook the temporal variation of scenarios and lack adjustability in cache budgets. In this paper, we develop detailed analyses and find that the impacts of these challenges exhibit invariance and analyzability in the frequency domain. Based on these, we propose a frequency-guided token caching framework, called FreqCache. Utilizing the inherent properties of the frequency domain, FreqCache achieves optimal token cache establishment, refreshment, and adaptive adjustment. Experiments show that FreqCache achieves 1.59x speedup with ignorable overhead, showing the effect of integrating frequency domain methods in VLN token caching.
Problem

Research questions and friction points this paper is trying to address.

token caching
Vision-Language Navigation
frequency domain
computational overhead
viewpoint migration
Innovation

Methods, ideas, or system contributions that make the work stand out.

frequency domain
token caching
embodied navigation
adaptive caching
Vision-Language Navigation
Zihao Zheng
Zihao Zheng
Peking University
Machine Learning SystemEdge ComputingComputer ArchitectureEDA
X
Xingyue Zhou
School of Artificial Intelligence and Automation, Huazhong University of Science and Technology
Z
Zhihao Mao
School of Computer Science, China University of Geosciences (Wuhan)
S
Songyu Sun
College of Computer Science and Electronic Engineering, Hunan University
L
Lingyue Zhang
School of EECS, Peking University
Y
Yulong Ao
Beijing Academy of Artificial Intelligence, BAAI
Y
Yupu Feng
Beijing Academy of Artificial Intelligence, BAAI
Q
Qiongqiong Zhang
Beijing Academy of Artificial Intelligence, BAAI
Y
Yonghua Lin
Beijing Academy of Artificial Intelligence, BAAI
X
Xiang Chen
School of Computer Science, Peking University