FLARE: Robot Learning with Implicit World Modeling

📅 2025-05-21

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This work addresses the limited long-horizon future awareness in robotic policy learning. We propose FLARE, a framework that integrates implicit latent world modeling into vision-language-action (VLA) models, leveraging a diffusion Transformer to align latent representations across temporal steps—enabling policies to anticipate future observations’ latent states while accounting for long-term consequences. A key innovation is a lightweight future latent alignment mechanism: it introduces only a small number of additional tokens, facilitating action-free, first-person video co-training. FLARE achieves state-of-the-art performance on both single-arm and humanoid robot tabletop manipulation benchmarks, outperforming prior methods by up to 26%. Moreover, it enables zero-shot generalization to geometrically novel objects using just one real-world demonstration.

Technology Category

Application Category

📝 Abstract

We introduce $ extbf{F}$uture $ extbf{LA}$tent $ extbf{RE}$presentation Alignment ($ extbf{FLARE}$), a novel framework that integrates predictive latent world modeling into robot policy learning. By aligning features from a diffusion transformer with latent embeddings of future observations, $ extbf{FLARE}$ enables a diffusion transformer policy to anticipate latent representations of future observations, allowing it to reason about long-term consequences while generating actions. Remarkably lightweight, $ extbf{FLARE}$ requires only minimal architectural modifications -- adding a few tokens to standard vision-language-action (VLA) models -- yet delivers substantial performance gains. Across two challenging multitask simulation imitation learning benchmarks spanning single-arm and humanoid tabletop manipulation, $ extbf{FLARE}$ achieves state-of-the-art performance, outperforming prior policy learning baselines by up to 26%. Moreover, $ extbf{FLARE}$ unlocks the ability to co-train with human egocentric video demonstrations without action labels, significantly boosting policy generalization to a novel object with unseen geometry with as few as a single robot demonstration. Our results establish $ extbf{FLARE}$ as a general and scalable approach for combining implicit world modeling with high-frequency robotic control.

Problem

Research questions and friction points this paper is trying to address.

Integrates predictive latent world modeling into robot policy learning

Enables anticipation of future observations for long-term reasoning

Boosts policy generalization with human video demonstrations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Aligns diffusion transformer features with future latent embeddings

Requires minimal architectural modifications to VLA models

Enables co-training with human video demonstrations without action labels

🔎 Similar Papers

What Foundation Models can Bring for Robot Learning in Manipulation : A Survey

2024-04-28arXiv.orgCitations: 15

Learning Manipulation Skills through Robot Chain-of-Thought with Sparse Failure Guidance

2024-05-22arXiv.orgCitations: 1