OxyGen: Unified KV Cache Management for Vision-Language-Action Models under Multi-Task Parallelism

📅 2026-03-15

📈 Citations: 0

✨ Influential: 0

career value

252K/year

🤖 AI Summary

This work addresses the inefficiencies in deploying vision-language-action multitask models on edge devices, where conventional isolated KV cache management leads to redundant computation and resource contention. The authors propose a unified KV cache management paradigm that, for the first time, treats the KV cache as a first-class shared resource across tasks and timesteps. By enabling cross-task prefix sharing, decoupling heterogeneous outputs, and applying continuous batching across video frames, the approach significantly enhances inference efficiency. Integrated with a Mixture-of-Transformers architecture, the method achieves up to 3.7× speedup on the π₀.₅ model, delivering over 200 tokens/s language throughput and 70 Hz action generation frequency without compromising action quality.

Technology Category

Application Category

📝 Abstract

Embodied AI agents increasingly require parallel execution of multiple tasks, such as manipulation, conversation, and memory construction, from shared observations under distinct time constraints. Recent Mixture-of-Transformers (MoT) Vision-Language-Action Models (VLAs) architecturally support such heterogeneous outputs, yet existing inference systems fail to achieve efficient multi-task parallelism for on-device deployment due to redundant computation and resource contention. We identify isolated KV cache management as the root cause. To address this, we propose unified KV cache management, an inference paradigm that treats KV cache as a first-class shared resource across tasks and over time. This abstraction enables two key optimizations: cross-task KV sharing eliminates redundant prefill of shared observations, while cross-frame continuous batching decouples variable-length language decoding from fixed-rate action generation across control cycles. We implement this paradigm for $π_{0.5}$, the most popular MoT VLA, and evaluate under representative robotic configurations. OxyGen achieves up to 3.7$\times$ speedup over isolated execution, delivering over 200 tokens/s language throughput and 70 Hz action frequency simultaneously without action quality degradation.

Problem

Research questions and friction points this paper is trying to address.

KV cache management

multi-task parallelism

Vision-Language-Action Models

on-device deployment

resource contention

Innovation

Methods, ideas, or system contributions that make the work stand out.

unified KV cache management

multi-task parallelism

Vision-Language-Action Models