Generalist World Model Pre-Training for Efficient Reinforcement Learning

📅 2025-02-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Robot reinforcement learning (RL) suffers from low sample efficiency due to reliance on expert-labeled demonstrations and task-specific reward signals. Method: We propose a general learning framework for reward-free, non-expert, multimodal offline data. Our approach innovatively integrates universal world model pre-training (WPT) with retrieval-augmented experience replay, execution-guided policy optimization, and multi-task, multimodal joint representation learning—enabling efficient policy acquisition and rapid task transfer without any reward signals or expert demonstrations. Results: Evaluated across 72 visuomotor tasks spanning six robot morphologies, our method achieves a 35.65% average performance gain over zero-shot baselines. It significantly improves generalization under challenging conditions—including sparse exploration, dynamic model shifts, and high visual diversity. This work is the first to demonstrate that universal world models can effectively leverage uncurated, cross-morphology offline data, establishing a new paradigm for reducing data dependency in robotic RL.

Technology Category

Application Category

📝 Abstract
Sample-efficient robot learning is a longstanding goal in robotics. Inspired by the success of scaling in vision and language, the robotics community is now investigating large-scale offline datasets for robot learning. However, existing methods often require expert and/or reward-labeled task-specific data, which can be costly and limit their application in practice. In this paper, we consider a more realistic setting where the offline data consists of reward-free and non-expert multi-embodiment offline data. We show that generalist world model pre-training (WPT), together with retrieval-based experience rehearsal and execution guidance, enables efficient reinforcement learning (RL) and fast task adaptation with such non-curated data. In experiments over 72 visuomotor tasks, spanning 6 different embodiments, covering hard exploration, complex dynamics, and various visual properties, WPT achieves 35.65% and 35% higher aggregated score compared to widely used learning-from-scratch baselines, respectively.
Problem

Research questions and friction points this paper is trying to address.

Enhances reinforcement learning efficiency
Utilizes non-expert, reward-free data
Improves task adaptation across embodiments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generalist world model pre-training
Retrieval-based experience rehearsal
Execution guidance for RL
🔎 Similar Papers
No similar papers found.
Y
Yi Zhao
Aalto University, Finland
Aidan Scannell
Aidan Scannell
University of Edinburgh
machine learningsequential decision-makingrobotics
Yuxin Hou
Yuxin Hou
Niantic Inc.
computer visionmachine learning
T
Tianyu Cui
Aalto University, Finland
L
Le Chen
Max Planck Institute for Intelligent Systems, Germany
D
Dieter Buchler
Max Planck Institute for Intelligent Systems, Germany; University of Alberta, Canada; Alberta Machine Intelligence Institute (Amii), Canada
Arno Solin
Arno Solin
Associate Professor in Machine Learning, Aalto University
Machine learningGaussian processesSensor fusionGenerative modelling
Juho Kannala
Juho Kannala
Associate Professor, Aalto University & University of Oulu, Finland
Computer VisionMachine Learning
J
J. Pajarinen
Aalto University, Finland