🤖 AI Summary
Energy-based policies—where the negative Q-function serves as an energy function—are hindered in continuous control by the difficulty of exact sampling from the associated Boltzmann distribution. Method: This paper introduces diffusion models into the RL policy-sampling framework for the first time, proposing an energy-driven denoising sampler that bypasses reliance on simple parametric distributions (e.g., Gaussian) and enables efficient, stable generation of arbitrary complex, multimodal action distributions. Technically, it integrates maximum-entropy RL, Actor-Critic architecture, and an energy-function-guided diffusion process to jointly ensure training stability and enhance policy diversity and exploration efficiency. Contribution/Results: Experiments on standard continuous-control benchmarks demonstrate improved convergence stability and superior performance over mainstream Gaussian-based policies, effectively overcoming the expressive limitations of conventional energy-based policies.
📝 Abstract
Energy-based policies offer a flexible framework for modeling complex, multimodal behaviors in reinforcement learning (RL). In maximum entropy RL, the optimal policy is a Boltzmann distribution derived from the soft Q-function, but direct sampling from this distribution in continuous action spaces is computationally intractable. As a result, existing methods typically use simpler parametric distributions, like Gaussians, for policy representation - limiting their ability to capture the full complexity of multimodal action distributions. In this paper, we introduce a diffusion-based approach for sampling from energy-based policies, where the negative Q-function defines the energy function. Based on this approach, we propose an actor-critic method called Diffusion Q-Sampling (DQS) that enables more expressive policy representations, allowing stable learning in diverse environments. We show that our approach enhances exploration and captures multimodal behavior in continuous control tasks, addressing key limitations of existing methods.