π€ AI Summary
This work addresses the challenge that existing vision-language-action (VLA) models struggle to learn action-relevant state transitions during pretraining due to confounding factors such as appearance bias, spurious motion cues, and information leakage from future frames. To mitigate these issues, the authors propose a JEPA-inspired two-stage pretraining framework: first, a target encoder generates latent representations of future frames, and a student network predicts these latent states solely from current observations, thereby modeling dynamics in latent space without access to future information; second, an action head is fine-tuned for efficient policy learning. This approach significantly enhances robustness to camera motion and background variations, streamlines the conventional multi-stage pipeline, and achieves superior generalization performance across LIBERO, LIBERO-Plus, SimplerEnv, and real-world manipulation tasks compared to existing methods.
π Abstract
Pretraining Vision-Language-Action (VLA) policies on internet-scale video is appealing, yet current latent-action objectives often learn the wrong thing: they remain anchored to pixel variation rather than action-relevant state transitions, making them vulnerable to appearance bias, nuisance motion, and information leakage. We introduce VLA-JEPA, a JEPA-style pretraining framework that sidesteps these pitfalls by design. The key idea is \emph{leakage-free state prediction}: a target encoder produces latent representations from future frames, while the student pathway sees only the current observation -- future information is used solely as supervision targets, never as input. By predicting in latent space rather than pixel space, VLA-JEPA learns dynamics abstractions that are robust to camera motion and irrelevant background changes. This yields a simple two-stage recipe -- JEPA pretraining followed by action-head fine-tuning -- without the multi-stage complexity of prior latent-action pipelines. Experiments on LIBERO, LIBERO-Plus, SimplerEnv and real-world manipulation tasks show that VLA-JEPA achieves consistent gains in generalization and robustness over existing methods.