One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy

📅 2026-05-08

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

Existing world model–enhanced vision-language-action (VLA) approaches struggle to effectively couple visual representations with actions under frozen backbones and suffer from inefficient visual bandwidth utilization. This work proposes a novel architecture that represents each video frame using only a single semantic token: visual inputs are compressed via adaptive attention pooling, and a unified flow-matching objective jointly models latent state sequences and action trajectories, enabling efficient long-horizon planning under extremely low visual bandwidth. The method requires no high-bandwidth visual input and integrates a lightweight world module into a frozen VLA backbone with LoRA fine-tuning. Experiments demonstrate significant performance gains—improving success rates on MetaWorld MT50 from 47.9% to 61.3%, achieving 95.6% on LIBERO-Long (up from 85.2%), and boosting real-world cloth-folding success with a Piper robot arm from 20.0% to 60.0%.

📝 Abstract

Vision-language-action (VLA) models increasingly rely on auxiliary world modules to plan over long horizons, yet how such modules should be parameterized on top of a pretrained VLA remains an open design question. Existing world-model-augmented VLAs typically pass the per-frame visual stream into the world module at high visual bandwidth and treat its rollout as a side product of action prediction; under a constrained adaptation budget on a frozen backbone, this leaves both the per-frame representation and the latent action coupling under-examined. We introduce OneWM-VLA, which compresses each view into a single semantic token per frame through an Adaptive Attention Pooling, and produces the resulting latent stream and the action trajectory under a single flow-matching objective rather than connecting them through a separate decoder. Empirically, we find that per-frame visual bandwidth can be reduced to a single token without compromising long-horizon performance under our setup. Trained with 14.71M LoRA parameters on a $π_0$ (2B) backbone, OneWM-VLA improves the average success rate from 47.9% to 61.3% on MetaWorld~MT50, reaches 95.6% on LIBERO-Long (vs.85.2% for $π_0$), and reaches 60.0% on the long-horizon deformable task Fold Cloth on a real Piper arm (vs.20.0% for $π_0$).

Problem

Research questions and friction points this paper is trying to address.

visual bandwidth

world models

VLA policy

long-horizon planning

per-frame representation

Innovation

Methods, ideas, or system contributions that make the work stand out.

One Token Per Frame

Adaptive Attention Pooling

Flow-Matching Objective