TempoFit: Plug-and-Play Layer-Wise Temporal KV Memory for Long-Horizon Vision-Language-Action Manipulation

📅 2026-03-08

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

This work addresses the challenge of non-Markovian dynamics—such as occlusions and state ambiguities—in long-horizon robotic manipulation, where pretrained vision-language-action (VLA) policies often fail due to the absence of memory mechanisms. The authors propose a training-free, plug-and-play temporal enhancement method that leverages cached prefix attention keys and values (K/V) from intermediate layers of a frozen VLA model as content-addressable state memory. By integrating frame-gap temporal bias (FGTB), parameter-free K-to-K retrieval, and norm-preserving residual injection, the approach enables efficient historical context modeling without introducing additional parameters or computational overhead. Evaluated on LIBERO-LONG, the method improves average success rates by 4.0% while maintaining near real-time inference, and demonstrates strong transferability to both CALVIN and real-world robotic tasks.

Technology Category

Application Category

📝 Abstract

Pretrained Vision-Language-Action (VLA) policies have achieved strong single-step manipulation, but their inference remains largely memoryless, which is brittle in non-Markovian long-horizon settings with occlusion, state aliasing, and subtle post-action changes. Prior approaches inject history either by stacking frames, which scales visual tokens and latency while adding near-duplicate pixels, or by learning additional temporal interfaces that require (re-)training and may break the original single-frame inference graph. We present TempoFit, a training-free temporal retrofit that upgrades frozen VLAs through state-level memory. Our key insight is that prefix attention K/V already form a model-native, content-addressable runtime state; reusing them across timesteps introduces history without new tokens or trainable modules. TempoFit stores layer-wise FIFO prefix K/V at selected intermediate layers, performs parameter-free K-to-K retrieval with Frame-Gap Temporal Bias (FGTB), a fixed recency bias inspired by positional biases in NLP, to keep decisions present-dominant, and injects the retrieved context via pre-attention residual loading with norm-preserving rescaling to avoid distribution shift under frozen weights. On LIBERO-LONG, TempoFit improves strong pretrained backbones by up to +4.0% average success rate while maintaining near-real-time latency, and it transfers consistently to CALVIN and real-robot long-horizon tasks.

Problem

Research questions and friction points this paper is trying to address.

long-horizon manipulation

memoryless inference

non-Markovian environments

vision-language-action policies

temporal reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Temporal Memory

Plug-and-Play

Prefix Attention