🤖 AI Summary
In online reinforcement learning, diffusion-based policies suffer from training instability due to reliance on optimal-policy sampling and policy-gradient backpropagation through stochastic diffusion steps. This work proposes a lightweight online training framework that, for the first time, formulates diffusion policies as noise-perturbed energy-based models (EBMs) with the Q-function serving as the energy function. By leveraging denoising score matching, our method directly optimizes the policy without requiring on-policy sampling or gradient propagation through the diffusion process—effectively bypassing these bottlenecks. The approach tightly integrates diffusion modeling, soft Actor-Critic (SAC), and EBM principles into a fully differentiable, end-to-end trainable architecture. Evaluated on the MuJoCo benchmark, it consistently outperforms existing online diffusion policy methods. Notably, it achieves over 120% performance gains over SAC on Humanoid and Ant tasks, substantially enhancing both the practicality and scalability of diffusion policies in online RL settings.
📝 Abstract
Diffusion policies have achieved superior performance in imitation learning and offline reinforcement learning (RL) due to their rich expressiveness. However, the vanilla diffusion training procedure requires samples from target distribution, which is impossible in online RL since we cannot sample from the optimal policy, making training diffusion policies highly non-trivial in online RL. Backpropagating policy gradient through the diffusion process incurs huge computational costs and instability, thus being expensive and impractical. To enable efficient diffusion policy training for online RL, we propose Soft Diffusion Actor-Critic (SDAC), exploiting the viewpoint of diffusion models as noise-perturbed energy-based models. The proposed SDAC relies solely on the state-action value function as the energy functions to train diffusion policies, bypassing sampling from the optimal policy while maintaining lightweight computations. We conducted comprehensive comparisons on MuJoCo benchmarks. The empirical results show that SDAC outperforms all recent diffusion-policy online RLs on most tasks, and improves more than 120% over soft actor-critic on complex locomotion tasks such as Humanoid and Ant.