π€ AI Summary
This work addresses the limited generalization of existing world modelβbased reinforcement learning approaches for vision-language-action (VLA) agents, which typically require task-specific fine-tuning of both the world and reward models. The authors propose RAW-Dream, a framework that fully decouples world model learning from downstream tasks by leveraging a task-agnostic pretrained world model and an off-the-shelf vision-language model (VLM) to generate rewards, enabling zero-shot policy fine-tuning within imagined environments. To ensure the reliability of synthetic trajectories, the method introduces a dual-noise verification mechanism that filters out low-quality imagined data. Evaluated in both simulated and real-world settings, RAW-Dream consistently outperforms baselines, demonstrating that generic physical priors can effectively replace costly task-specific data and substantially enhance cross-task zero-shot transfer capabilities.
π Abstract
Post-training Vision-Language-Action (VLA) models via reinforcement learning (RL) in learned world models has emerged as an effective strategy to adapt to new tasks without costly real-world interactions. However, while using imagined trajectories reduces the sample complexity of policy training, existing methods still heavily rely on task-specific data to fine-tune both the world and reward models, fundamentally limiting their scalability to unseen tasks. To overcome this, we argue that world and reward models should capture transferable physical priors that enable zero-shot inference. We propose RAW-Dream (Reinforcing VLAs in task-Agnostic World Dreams), a new paradigm that completely disentangles world model learning from downstream task dependencies. RAW-Dream utilizes a world model pre-trained on diverse task-free behaviors for predicting future rollouts, and an off-the-shelf Vision-Language Model (VLM) for reward generation. Because both components are task-agnostic, VLAs can be readily finetuned for any new task entirely within this zero-shot imagination. Furthermore, to mitigate world model hallucinations, we introduce a dual-noise verification mechanism to filter out unreliable rollouts. Extensive experiments across simulation and real-world settings demonstrate consistent performance gains, proving that generalized physical priors can effectively substitute for costly task-dependent data, offering a highly scalable roadmap for VLA adaptation.