VLA-JEPA: Enhancing Vision-Language-Action Model with Latent World Model

📅 2026-02-10

📈 Citations: 0

✨ Influential: 0

career value

249K/year

🤖 AI Summary

This work addresses the challenge that existing vision-language-action (VLA) models struggle to learn action-relevant state transitions during pretraining due to confounding factors such as appearance bias, spurious motion cues, and information leakage from future frames. To mitigate these issues, the authors propose a JEPA-inspired two-stage pretraining framework: first, a target encoder generates latent representations of future frames, and a student network predicts these latent states solely from current observations, thereby modeling dynamics in latent space without access to future information; second, an action head is fine-tuned for efficient policy learning. This approach significantly enhances robustness to camera motion and background variations, streamlines the conventional multi-stage pipeline, and achieves superior generalization performance across LIBERO, LIBERO-Plus, SimplerEnv, and real-world manipulation tasks compared to existing methods.

Technology Category

Application Category

📝 Abstract

Pretraining Vision-Language-Action (VLA) policies on internet-scale video is appealing, yet current latent-action objectives often learn the wrong thing: they remain anchored to pixel variation rather than action-relevant state transitions, making them vulnerable to appearance bias, nuisance motion, and information leakage. We introduce VLA-JEPA, a JEPA-style pretraining framework that sidesteps these pitfalls by design. The key idea is \emph{leakage-free state prediction}: a target encoder produces latent representations from future frames, while the student pathway sees only the current observation -- future information is used solely as supervision targets, never as input. By predicting in latent space rather than pixel space, VLA-JEPA learns dynamics abstractions that are robust to camera motion and irrelevant background changes. This yields a simple two-stage recipe -- JEPA pretraining followed by action-head fine-tuning -- without the multi-stage complexity of prior latent-action pipelines. Experiments on LIBERO, LIBERO-Plus, SimplerEnv and real-world manipulation tasks show that VLA-JEPA achieves consistent gains in generalization and robustness over existing methods.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action

latent world model

appearance bias

information leakage

state transition

Innovation

Methods, ideas, or system contributions that make the work stand out.

VLA-JEPA

latent world model

leakage-free prediction