🤖 AI Summary
This work addresses the limited long-horizon future awareness in robotic policy learning. We propose FLARE, a framework that integrates implicit latent world modeling into vision-language-action (VLA) models, leveraging a diffusion Transformer to align latent representations across temporal steps—enabling policies to anticipate future observations’ latent states while accounting for long-term consequences. A key innovation is a lightweight future latent alignment mechanism: it introduces only a small number of additional tokens, facilitating action-free, first-person video co-training. FLARE achieves state-of-the-art performance on both single-arm and humanoid robot tabletop manipulation benchmarks, outperforming prior methods by up to 26%. Moreover, it enables zero-shot generalization to geometrically novel objects using just one real-world demonstration.
📝 Abstract
We introduce $ extbf{F}$uture $ extbf{LA}$tent $ extbf{RE}$presentation Alignment ($ extbf{FLARE}$), a novel framework that integrates predictive latent world modeling into robot policy learning. By aligning features from a diffusion transformer with latent embeddings of future observations, $ extbf{FLARE}$ enables a diffusion transformer policy to anticipate latent representations of future observations, allowing it to reason about long-term consequences while generating actions. Remarkably lightweight, $ extbf{FLARE}$ requires only minimal architectural modifications -- adding a few tokens to standard vision-language-action (VLA) models -- yet delivers substantial performance gains. Across two challenging multitask simulation imitation learning benchmarks spanning single-arm and humanoid tabletop manipulation, $ extbf{FLARE}$ achieves state-of-the-art performance, outperforming prior policy learning baselines by up to 26%. Moreover, $ extbf{FLARE}$ unlocks the ability to co-train with human egocentric video demonstrations without action labels, significantly boosting policy generalization to a novel object with unseen geometry with as few as a single robot demonstration. Our results establish $ extbf{FLARE}$ as a general and scalable approach for combining implicit world modeling with high-frequency robotic control.