Diffusion Policy through Conditional Proximal Policy Optimization

📅 2026-03-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Diffusion policies in reinforcement learning have been limited by the computational inefficiency of evaluating action log-likelihoods. This work proposes a novel training paradigm that aligns policy iteration with the diffusion process, enabling on-policy training of diffusion policies by evaluating only simple Gaussian probabilities—thereby circumventing the need to simulate the full denoising trajectory required by conventional approaches and substantially reducing computational and memory overhead. The method naturally supports entropy regularization and achieves stable training through conditional proximal policy optimization. Evaluated across multiple benchmark tasks in IsaacLab and MuJoCo Playground, the proposed approach not only attains superior performance but also generates multimodal policy behaviors.

Technology Category

Application Category

📝 Abstract
Reinforcement learning (RL) has been extensively employed in a wide range of decision-making problems, such as games and robotics. Recently, diffusion policies have shown strong potential in modeling multi-modal behaviors, enabling more diverse and flexible action generation compared to the conventional Gaussian policy. Despite various attempts to combine RL with diffusion, a key challenge is the difficulty of computing action log-likelihood under the diffusion model. This greatly hinders the direct application of diffusion policies in on-policy reinforcement learning. Most existing methods calculate or approximate the log-likelihood through the entire denoising process in the diffusion model, which can be memory- and computationally inefficient. To overcome this challenge, we propose a novel and efficient method to train a diffusion policy in an on-policy setting that requires only evaluating a simple Gaussian probability. This is achieved by aligning the policy iteration with the diffusion process, which is a distinct paradigm compared to previous work. Moreover, our formulation can naturally handle entropy regularization, which is often difficult to incorporate into diffusion policies. Experiments demonstrate that the proposed method produces multimodal policy behaviors and achieves superior performance on a variety of benchmark tasks in both IsaacLab and MuJoCo Playground.
Problem

Research questions and friction points this paper is trying to address.

diffusion policy
reinforcement learning
log-likelihood
on-policy learning
action generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion Policy
On-policy Reinforcement Learning
Conditional Proximal Policy Optimization
Entropy Regularization
Multimodal Action Generation
🔎 Similar Papers
2024-07-16arXiv.orgCitations: 2
B
Ben Liu
1Southern University of Science and Technology, Shenzhen, China; 2Limx Dynamics
S
Shunpeng Yang
3Hong Kong University of Science and Technology, Hongkong, China; 2Limx Dynamics
Hua Chen
Hua Chen
Assistant Professor, ZJU-UIUC Institute; Co-founder, LimX Dynamics
RoboticsEmbodied AIRobot LearningReinforcement LearningControl