World4RL: Diffusion World Models for Policy Refinement with Reinforcement Learning for Robotic Manipulation

📅 2025-09-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the scarcity of expert demonstration data and the Sim2Real transfer challenge in robotic manipulation, this paper proposes a policy refinement framework built upon a frozen diffusion world model. Methodologically, it introduces the first use of a pre-trained diffusion model as a high-fidelity “imagined environment,” enabling end-to-end policy optimization without real-world robot interaction. A robot-specific two-hot action encoding scheme is designed to improve action-space modeling accuracy. By combining multi-task pre-training with model freezing, the approach balances dynamic modeling fidelity and optimization efficiency. Experimental results demonstrate significant improvements in task success rates on both simulated and real robotic arm platforms. The method effectively reduces reliance on physical interaction, mitigates the Sim2Real gap, and supports continual policy refinement—validating its efficacy for data-efficient, transferable robotic learning.

Technology Category

Application Category

📝 Abstract
Robotic manipulation policies are commonly initialized through imitation learning, but their performance is limited by the scarcity and narrow coverage of expert data. Reinforcement learning can refine polices to alleviate this limitation, yet real-robot training is costly and unsafe, while training in simulators suffers from the sim-to-real gap. Recent advances in generative models have demonstrated remarkable capabilities in real-world simulation, with diffusion models in particular excelling at generation. This raises the question of how diffusion model-based world models can be combined to enhance pre-trained policies in robotic manipulation. In this work, we propose World4RL, a framework that employs diffusion-based world models as high-fidelity simulators to refine pre-trained policies entirely in imagined environments for robotic manipulation. Unlike prior works that primarily employ world models for planning, our framework enables direct end-to-end policy optimization. World4RL is designed around two principles: pre-training a diffusion world model that captures diverse dynamics on multi-task datasets and refining policies entirely within a frozen world model to avoid online real-world interactions. We further design a two-hot action encoding scheme tailored for robotic manipulation and adopt diffusion backbones to improve modeling fidelity. Extensive simulation and real-world experiments demonstrate that World4RL provides high-fidelity environment modeling and enables consistent policy refinement, yielding significantly higher success rates compared to imitation learning and other baselines. More visualization results are available at https://world4rl.github.io/.
Problem

Research questions and friction points this paper is trying to address.

Refining robotic manipulation policies without costly real-world training
Overcoming sim-to-real gap limitations in policy refinement methods
Enhancing pre-trained policies using diffusion-based world models simulation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses diffusion models as high-fidelity simulators
Enables end-to-end policy optimization in imagined environments
Employs two-hot action encoding for robotic manipulation
Zhennan Jiang
Zhennan Jiang
Institute of Automation, Chinese Academy of Sciences
Reinforcement learningRobotics
K
Kai Liu
The State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China; School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China
Y
Yuxin Qin
The State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China; School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China
S
Shuai Tian
The State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China; School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China
Yupeng Zheng
Yupeng Zheng
Institute of Automation, Chinese Academy of Sciences
M
Mingcai Zhou
Beijing Zhongke Huiling Robot Technology Co, Beijing, China
C
Chao Yu
Department of Electronic Engineering, Tsinghua University, Beijing, China; Zhongguancun Academy, Beijing, China
H
Haoran Li
The State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China; School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China
Dongbin Zhao
Dongbin Zhao
Institute of Automation, Chinese Academy of Sciences
Deep Reinforcement LearningAdaptive Dynamic ProgrammingGame AISmart drivingrobotics