GenPO: Generative Diffusion Models Meet On-Policy Reinforcement Learning

📅 2025-05-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the core challenge of integrating diffusion policies into on-policy reinforcement learning frameworks—specifically, the intractability of efficiently computing state-action log-likelihoods within large-scale parallel GPU simulators (e.g., IsaacLab). To resolve this, we propose: (1) an exact diffusion inversion process coupled with a dual virtual action mechanism to enable invertible action mapping; and (2) the first unbiased estimator for action entropy and KL divergence based on action log-likelihood, enabling KL-adaptive learning rates and entropy regularization. Our method achieves the first end-to-end on-policy training of diffusion policies within the PPO framework. Empirically, it outperforms mainstream RL baselines across eight challenging IsaacLab robotic tasks—including Ant and Humanoid—and has been successfully deployed on real robotic hardware.

Technology Category

Application Category

📝 Abstract
Recent advances in reinforcement learning (RL) have demonstrated the powerful exploration capabilities and multimodality of generative diffusion-based policies. While substantial progress has been made in offline RL and off-policy RL settings, integrating diffusion policies into on-policy frameworks like PPO remains underexplored. This gap is particularly significant given the widespread use of large-scale parallel GPU-accelerated simulators, such as IsaacLab, which are optimized for on-policy RL algorithms and enable rapid training of complex robotic tasks. A key challenge lies in computing state-action log-likelihoods under diffusion policies, which is straightforward for Gaussian policies but intractable for flow-based models due to irreversible forward-reverse processes and discretization errors (e.g., Euler-Maruyama approximations). To bridge this gap, we propose GenPO, a generative policy optimization framework that leverages exact diffusion inversion to construct invertible action mappings. GenPO introduces a novel doubled dummy action mechanism that enables invertibility via alternating updates, resolving log-likelihood computation barriers. Furthermore, we also use the action log-likelihood for unbiased entropy and KL divergence estimation, enabling KL-adaptive learning rates and entropy regularization in on-policy updates. Extensive experiments on eight IsaacLab benchmarks, including legged locomotion (Ant, Humanoid, Anymal-D, Unitree H1, Go2), dexterous manipulation (Shadow Hand), aerial control (Quadcopter), and robotic arm tasks (Franka), demonstrate GenPO's superiority over existing RL baselines. Notably, GenPO is the first method to successfully integrate diffusion policies into on-policy RL, unlocking their potential for large-scale parallelized training and real-world robotic deployment.
Problem

Research questions and friction points this paper is trying to address.

Integrating diffusion policies into on-policy RL frameworks like PPO
Computing state-action log-likelihoods under diffusion policies
Enabling KL-adaptive learning rates and entropy regularization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generative diffusion policies in on-policy RL
Exact diffusion inversion for invertible action mappings
Doubled dummy action mechanism for invertibility
🔎 Similar Papers
No similar papers found.
S
Shutong Ding
ShanghaiTech University
K
Ke Hu
ShanghaiTech University
S
Shan Zhong
University of Electronic Science and Technology of China
Haoyang Luo
Haoyang Luo
City University of Hong Kong, Hong Kong
Multimodal
W
Weinan Zhang
Shanghai Jiao Tong University
Jingya Wang
Jingya Wang
Assistant Professor, ShanghaiTech University
Computer VisionEmbodied AIHuman-Object Interaction
J
Jun Wang
University College London
Ye Shi
Ye Shi
Assistant Professor, School of Information Science and Technology, Shanghaitech University
Embodied AIGenerative ModelsOptimization and ControlVision Language Models