Scaling Offline Model-Based RL via Jointly-Optimized World-Action Model Pretraining

📅 2024-10-01

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

Offline reinforcement learning faces two key bottlenecks: reliance on expert demonstrations and poor cross-task generalization. This paper proposes Jointly Optimized World-Action modeling (JOWA), the first framework to unify image-based world modeling and action policy learning within a single architecture. JOWA integrates error-compensating parallel model predictive control (MPC) planning with token-level large-scale pretraining (6 billion tokens). It enables stable temporal-difference pretraining at a 150M-parameter scale without any expert data. On the Atari offline benchmark, JOWA achieves 78.9% of human performance using only 10% of the dataset—outperforming prior state-of-the-art methods by an average of 31.6%. Crucially, it demonstrates strong zero-shot transfer: adapting to unseen games requires merely 5k environment steps per task (~4 trajectories), substantially improving cross-game generalization.

Technology Category

Application Category

📝 Abstract

A significant aspiration of offline reinforcement learning (RL) is to develop a generalist agent with high capabilities from large and heterogeneous datasets. However, prior approaches that scale offline RL either rely heavily on expert trajectories or struggle to generalize to diverse unseen tasks. Inspired by the excellent generalization of world model in conditional video generation, we explore the potential of image observation-based world model for scaling offline RL and enhancing generalization on novel tasks. In this paper, we introduce JOWA: Jointly-Optimized World-Action model, an offline model-based RL agent pretrained on multiple Atari games with 6 billion tokens data to learn general-purpose representation and decision-making ability. Our method jointly optimizes a world-action model through a shared transformer backbone, which stabilize temporal difference learning with large models during pretraining. Moreover, we propose a provably efficient and parallelizable planning algorithm to compensate for the Q-value estimation error and thus search out better policies. Experimental results indicate that our largest agent, with 150 million parameters, achieves 78.9% human-level performance on pretrained games using only 10% subsampled offline data, outperforming existing state-of-the-art large-scale offline RL baselines by 31.6% on averange. Furthermore, JOWA scales favorably with model capacity and can sample-efficiently transfer to novel games using only 5k offline fine-tuning data (approximately 4 trajectories) per game, demonstrating superior generalization. We will release codes and model weights at https://github.com/CJReinforce/JOWA

Problem

Research questions and friction points this paper is trying to address.

Develops a generalist agent from large datasets.

Enhances generalization on diverse unseen tasks.

Improves offline RL with efficient planning algorithms.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Jointly-optimized world-action model for offline RL

Shared transformer backbone stabilizes temporal difference learning

Efficient planning algorithm improves Q-value estimation

🔎 Similar Papers

PWM: Policy Learning with Multi-Task World Models