ContextVLA: Vision-Language-Action Model with Amortized Multi-Frame Context

📅 2025-10-05

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

In partially observable robotic tasks, multi-frame visual inputs incur high computational overhead and underutilize temporal information. Method: We propose a novel vision-language-action model that integrates the temporal reasoning capability of vision-language models with a lightweight context compression mechanism. Specifically, we introduce a learnable context encoder that aggregates multiple frames into a single context token, enabling amortized inference. Contribution/Results: Our approach retains the low computational cost of single-frame models while significantly improving policy performance—matching the accuracy of full-sequence input models across multiple benchmarks. Training and inference time decrease by over 50%. Empirical results demonstrate that compact historical representations suffice for complex partially observable decision-making, establishing a new paradigm for efficient embodied intelligence.

Technology Category

Application Category

📝 Abstract

Leveraging temporal context is crucial for success in partially observable robotic tasks. However, prior work in behavior cloning has demonstrated inconsistent performance gains when using multi-frame observations. In this paper, we introduce ContextVLA, a policy model that robustly improves robotic task performance by effectively leveraging multi-frame observations. Our approach is motivated by the key observation that Vision-Language-Action models (VLA), i.e., policy models built upon a Vision-Language Model (VLM), more effectively utilize multi-frame observations for action generation. This suggests that VLMs' inherent temporal understanding capability enables them to extract more meaningful context from multi-frame observations. However, the high dimensionality of video inputs introduces significant computational overhead, making VLA training and inference inefficient. To address this, ContextVLA compresses past observations into a single context token, allowing the policy to efficiently leverage temporal context for action generation. Our experiments show that ContextVLA consistently improves over single-frame VLAs and achieves the benefits of full multi-frame training but with reduced training and inference times.

Problem

Research questions and friction points this paper is trying to address.

Enhancing robotic task performance using multi-frame temporal context

Addressing computational inefficiency in Vision-Language-Action model training

Compressing multi-frame observations into efficient context tokens

Innovation

Methods, ideas, or system contributions that make the work stand out.

ContextVLA compresses past observations into context tokens

Model leverages multi-frame observations for action generation

Approach reduces computational overhead while maintaining performance

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs