PWM: Policy Learning with Multi-Task World Models

📅 2024-07-02

📈 Citations: 2

✨ Influential: 0

career value

224K/year

🤖 AI Summary

In multitask embodied reinforcement learning, existing world model–based policy extraction relies on inefficient gradient-free optimization, while gradient-based methods fail due to discontinuities in learned dynamics. This paper proposes a multitask world model policy learning framework: first, a regularized world model is pretrained to induce a smoother optimization landscape than the true dynamics; second, continuous control policies are efficiently extracted via first-order gradient optimization (e.g., L-BFGS) directly in the latent space. Our approach is the first to enable stable, pure gradient-based policy extraction in high-dimensional action spaces (152D) and large-scale multitask settings (80 tasks), achieving per-task runtime under 10 minutes and an average reward improvement of 27%—significantly outperforming ground-truth dynamics–based baselines.

Technology Category

Application Category

📝 Abstract

Reinforcement Learning (RL) has made significant strides in complex tasks but struggles in multi-task settings with different embodiments. World model methods offer scalability by learning a simulation of the environment but often rely on inefficient gradient-free optimization methods for policy extraction. In contrast, gradient-based methods exhibit lower variance but fail to handle discontinuities. Our work reveals that well-regularized world models can generate smoother optimization landscapes than the actual dynamics, facilitating more effective first-order optimization. We introduce Policy learning with multi-task World Models (PWM), a novel model-based RL algorithm for continuous control. Initially, the world model is pre-trained on offline data, and then policies are extracted from it using first-order optimization in less than 10 minutes per task. PWM effectively solves tasks with up to 152 action dimensions and outperforms methods that use ground-truth dynamics. Additionally, PWM scales to an 80-task setting, achieving up to 27% higher rewards than existing baselines without relying on costly online planning. Visualizations and code are available at https://www.imgeorgiev.com/pwm/.

Problem

Research questions and friction points this paper is trying to address.

Multi-task reinforcement learning scalability

Inefficient policy extraction in world models

Handling discontinuities in gradient-based methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-task world models

First-order optimization

Offline data pre-training

🔎 Similar Papers

Task Aware Dreamer for Task Generalization in Reinforcement Learning