Sampling from Energy-based Policies using Diffusion

📅 2024-10-02

🏛️ arXiv.org

📈 Citations: 3

✨ Influential: 0

career value

225K/year

🤖 AI Summary

Energy-based policies—where the negative Q-function serves as an energy function—are hindered in continuous control by the difficulty of exact sampling from the associated Boltzmann distribution. Method: This paper introduces diffusion models into the RL policy-sampling framework for the first time, proposing an energy-driven denoising sampler that bypasses reliance on simple parametric distributions (e.g., Gaussian) and enables efficient, stable generation of arbitrary complex, multimodal action distributions. Technically, it integrates maximum-entropy RL, Actor-Critic architecture, and an energy-function-guided diffusion process to jointly ensure training stability and enhance policy diversity and exploration efficiency. Contribution/Results: Experiments on standard continuous-control benchmarks demonstrate improved convergence stability and superior performance over mainstream Gaussian-based policies, effectively overcoming the expressive limitations of conventional energy-based policies.

Technology Category

Application Category

📝 Abstract

Energy-based policies offer a flexible framework for modeling complex, multimodal behaviors in reinforcement learning (RL). In maximum entropy RL, the optimal policy is a Boltzmann distribution derived from the soft Q-function, but direct sampling from this distribution in continuous action spaces is computationally intractable. As a result, existing methods typically use simpler parametric distributions, like Gaussians, for policy representation - limiting their ability to capture the full complexity of multimodal action distributions. In this paper, we introduce a diffusion-based approach for sampling from energy-based policies, where the negative Q-function defines the energy function. Based on this approach, we propose an actor-critic method called Diffusion Q-Sampling (DQS) that enables more expressive policy representations, allowing stable learning in diverse environments. We show that our approach enhances exploration and captures multimodal behavior in continuous control tasks, addressing key limitations of existing methods.

Problem

Research questions and friction points this paper is trying to address.

Sampling from energy-based policies in continuous action spaces

Overcoming limitations of simple parametric policy representations

Capturing multimodal behaviors in reinforcement learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion-based sampling for energy-based policies

Negative Q-function defines the energy function

Actor-critic method called Diffusion Q-Sampling

🔎 Similar Papers

Improved off-policy training of diffusion samplers