🤖 AI Summary
In partially observable robotic tasks, multi-frame visual inputs incur high computational overhead and underutilize temporal information. Method: We propose a novel vision-language-action model that integrates the temporal reasoning capability of vision-language models with a lightweight context compression mechanism. Specifically, we introduce a learnable context encoder that aggregates multiple frames into a single context token, enabling amortized inference. Contribution/Results: Our approach retains the low computational cost of single-frame models while significantly improving policy performance—matching the accuracy of full-sequence input models across multiple benchmarks. Training and inference time decrease by over 50%. Empirical results demonstrate that compact historical representations suffice for complex partially observable decision-making, establishing a new paradigm for efficient embodied intelligence.
📝 Abstract
Leveraging temporal context is crucial for success in partially observable robotic tasks. However, prior work in behavior cloning has demonstrated inconsistent performance gains when using multi-frame observations. In this paper, we introduce ContextVLA, a policy model that robustly improves robotic task performance by effectively leveraging multi-frame observations. Our approach is motivated by the key observation that Vision-Language-Action models (VLA), i.e., policy models built upon a Vision-Language Model (VLM), more effectively utilize multi-frame observations for action generation. This suggests that VLMs' inherent temporal understanding capability enables them to extract more meaningful context from multi-frame observations. However, the high dimensionality of video inputs introduces significant computational overhead, making VLA training and inference inefficient. To address this, ContextVLA compresses past observations into a single context token, allowing the policy to efficiently leverage temporal context for action generation. Our experiments show that ContextVLA consistently improves over single-frame VLAs and achieves the benefits of full multi-frame training but with reduced training and inference times.