🤖 AI Summary
This work addresses the limitations of existing world modeling approaches, which overly rely on pixel-level reconstruction—hindering semantic understanding—and suffer from error accumulation during inference due to direct observation prediction. To overcome these issues, the authors propose FRAPPE, a framework employing a two-stage fine-tuning strategy: first predicting latent representations of future observations to bypass the pixel reconstruction bottleneck, then aligning representations from multiple vision foundation models in parallel. This approach reduces dependence on action-labeled data while enhancing the policy’s general-purpose perceptual capabilities. Evaluated on the RoboTwin benchmark and real-world robotic tasks, FRAPPE significantly outperforms current methods, demonstrating exceptional generalization in long-horizon execution and unseen environments.
📝 Abstract
Enabling VLA models to predict environmental dynamics, known as world modeling, has been recognized as essential for improving robotic reasoning and generalization. However, current approaches face two main issues: 1. The training objective forces models to over-emphasize pixel-level reconstruction, which constrains semantic learning and generalization 2. Reliance on predicted future observations during inference often leads to error accumulation. To address these challenges, we introduce Future Representation Alignment via Parallel Progressive Expansion (FRAPPE). Our method adopts a two-stage fine-tuning strategy: In the mid-training phase, the model learns to predict the latent representations of future observations; In the post-training phase, we expand the computational workload in parallel and align the representation simultaneously with multiple different visual foundation models. By significantly improving fine-tuning efficiency and reducing dependence on action-annotated data, FRAPPE provides a scalable and data-efficient pathway to enhance world-awareness in generalist robotic policies. Experiments on the RoboTwin benchmark and real-world tasks demonstrate that FRAPPE outperforms state-of-the-art approaches and shows strong generalization in long-horizon and unseen scenarios.