Scaling World-Model Reinforcement Learning Through Diffusion Policy Optimization

📅 2026-05-25

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This work addresses the structural mismatch between search and value learning in existing world model–based reinforcement learning, which leads to training inconsistency and suboptimal policies. To resolve this issue, the authors propose MBDPO, a novel framework that introduces diffusion policies into latent world models for the first time. By constructing an implicit energy function from the dataset, MBDPO unifies search and policy learning through trajectory-level diffusion optimization, guiding the policy to approximate the optimal trajectory distribution via a diffusion process. This approach effectively eliminates the structural mismatch and enables scalable, large-capacity policy learning. Empirical results demonstrate strong performance across diverse settings—including multitask offline pretraining, online learning, and offline-to-online fine-tuning—with policy performance monotonically improving as model capacity increases.

📝 Abstract

Model-based reinforcement learning (RL) can be effectively supported at scale through the use of world models. However, in practice, scaling such approaches remains fundamentally limited. A commonly recognized challenge is model bias and error compounding, which degrade long-horizon predictions. Beyond these issues, we identify a more critical yet underexplored bottleneck: a structural misalignment between search and value learning in existing world model approaches. In particular, policy improvement often relies on value functions induced by a separate, non-search policy, resulting in training inconsistency and ultimately suboptimal learning. To address this limitation, we propose Model-Based Diffusion Policy Optimization (MBDPO) in world models, a framework that unifies search and policy optimization through diffusion policy representations, thereby unlocking the potential of world models for scalable policy learning. Instead of constructing an explicit planner over a learned world model, we reformulate policy optimization as a diffusion process over searched trajectories in latent world models. In this view, we extract an implicit energy function from the collected dataset that anchors the policy, enabling MBDPO to refine the score field for policy optimization while mitigating misalignment. We evaluate MBDPO across a wide range of settings, including multi-task offline pretraining, online learning, and offline-to-online fine-tuning. In the offline regime, we further investigate its scaling behavior by pretraining on large-scale datasets, observing consistent and monotonic performance gains with increasing model capacity.

Problem

Research questions and friction points this paper is trying to address.

world models

model-based reinforcement learning

search-policy misalignment

value learning

scaling

Innovation

Methods, ideas, or system contributions that make the work stand out.

diffusion policy

world model

model-based reinforcement learning