π€ AI Summary
Existing vision-language-action (VLA) models often neglect temporal causal structure or waste modeling capacity by reconstructing redundant background information, and they lack effective integration of continuous dynamics with world knowledge. This work proposes the βWorld Chainβ paradigm, which leverages a pretrained video VAE to disentangle structural and motion-related latent variables. It learns to infer a continuous latent motion chain from language instructions and an initial frame, then predicts the terminal frame accordingly. By jointly modeling keyframes and action sequences, the approach achieves discrete action alignment. This method uniquely unifies the temporal reasoning capabilities of world models with disentangled latent motion representations, establishing a more efficient VLA pretraining framework that maintains continuity, compactness, and interpretability. Experiments demonstrate superior performance over existing world models and latent action approaches on robotic simulation benchmarks, with strong computational efficiency, validating its effectiveness for embodied intelligence.
π Abstract
Vision-Language-Action (VLA) models are a promising path toward embodied intelligence, yet they often overlook the predictive and temporal-causal structure underlying visual dynamics. World-model VLAs address this by predicting future frames, but waste capacity reconstructing redundant backgrounds. Latent-action VLAs encode frame-to-frame transitions compactly, but lack temporally continuous dynamic modeling and world knowledge. To overcome these limitations, we introduce CoWVLA (Chain-of-World VLA), a new "Chain of World" paradigm that unifies world-model temporal reasoning with a disentangled latent motion representation. First, a pretrained video VAE serves as a latent motion extractor, explicitly factorizing video segments into structure and motion latents. Then, during pre-training, the VLA learns from an instruction and an initial frame to infer a continuous latent motion chain and predict the segment's terminal frame. Finally, during co-fine-tuning, this latent dynamic is aligned with discrete action prediction by jointly modeling sparse keyframes and action sequences in a unified autoregressive decoder. This design preserves the world-model benefits of temporal reasoning and world knowledge while retaining the compactness and interpretability of latent actions, enabling efficient visuomotor learning. Extensive experiments on robotic simulation benchmarks show that CoWVLA outperforms existing world-model and latent-action approaches and achieves moderate computational efficiency, highlighting its potential as a more effective VLA pretraining paradigm. The project website can be found at https://fx-hit.github.io/cowvla-io.