TempoFit: Plug-and-Play Layer-Wise Temporal KV Memory for Long-Horizon Vision-Language-Action Manipulation

📅 2026-03-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of non-Markovian dynamics—such as occlusions and state ambiguities—in long-horizon robotic manipulation, where pretrained vision-language-action (VLA) policies often fail due to the absence of memory mechanisms. The authors propose a training-free, plug-and-play temporal enhancement method that leverages cached prefix attention keys and values (K/V) from intermediate layers of a frozen VLA model as content-addressable state memory. By integrating frame-gap temporal bias (FGTB), parameter-free K-to-K retrieval, and norm-preserving residual injection, the approach enables efficient historical context modeling without introducing additional parameters or computational overhead. Evaluated on LIBERO-LONG, the method improves average success rates by 4.0% while maintaining near real-time inference, and demonstrates strong transferability to both CALVIN and real-world robotic tasks.

Technology Category

Application Category

📝 Abstract
Pretrained Vision-Language-Action (VLA) policies have achieved strong single-step manipulation, but their inference remains largely memoryless, which is brittle in non-Markovian long-horizon settings with occlusion, state aliasing, and subtle post-action changes. Prior approaches inject history either by stacking frames, which scales visual tokens and latency while adding near-duplicate pixels, or by learning additional temporal interfaces that require (re-)training and may break the original single-frame inference graph. We present TempoFit, a training-free temporal retrofit that upgrades frozen VLAs through state-level memory. Our key insight is that prefix attention K/V already form a model-native, content-addressable runtime state; reusing them across timesteps introduces history without new tokens or trainable modules. TempoFit stores layer-wise FIFO prefix K/V at selected intermediate layers, performs parameter-free K-to-K retrieval with Frame-Gap Temporal Bias (FGTB), a fixed recency bias inspired by positional biases in NLP, to keep decisions present-dominant, and injects the retrieved context via pre-attention residual loading with norm-preserving rescaling to avoid distribution shift under frozen weights. On LIBERO-LONG, TempoFit improves strong pretrained backbones by up to +4.0% average success rate while maintaining near-real-time latency, and it transfers consistently to CALVIN and real-robot long-horizon tasks.
Problem

Research questions and friction points this paper is trying to address.

long-horizon manipulation
memoryless inference
non-Markovian environments
vision-language-action policies
temporal reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Temporal Memory
Plug-and-Play
Prefix Attention
Frame-Gap Temporal Bias
Frozen VLA Adaptation
🔎 Similar Papers
No similar papers found.
J
Jun Sun
Xi’an Jiaotong-Liverpool University
B
Boyu Yang
Xi’an Jiaotong-Liverpool University
J
Jiahao Zhang
Xi’an Jiaotong-Liverpool University
Ning Ma
Ning Ma
University of Sheffield
Acoustic analysis for healthSleepSpeech and language technology
C
Chencheng Wu
Xi’an Jiaotong-Liverpool University
S
Siqing Zhang
Xi’an Jiaotong-Liverpool University
Y
Yiou Huang
Xi’an Jiaotong-Liverpool University
Qiufeng Wang
Qiufeng Wang
INT, SAT, XJTLU
document analysis and recognitionpattern recognitionmachine learning
S
Shan Liang
Xi’an Jiaotong-Liverpool University
Y
Yaran Chen
Xi’an Jiaotong-Liverpool University