DIME:Diffusion-Based Maximum Entropy Reinforcement Learning

📅 2025-02-04

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Gaussian policies in Maximum Entropy Reinforcement Learning (MaxEnt-RL) suffer from limited expressivity, while diffusion policies—though highly expressive—are incompatible with the MaxEnt framework due to intractable marginal entropy computation. Method: We derive the first variational lower bound on the entropy of diffusion policies, enabling end-to-end optimization. Building upon this, we propose a theoretically grounded iterative diffusion policy optimization algorithm that jointly integrates diffusion modeling, variational inference, and entropy regularization. Contribution/Results: Our approach achieves provable convergence guarantees while harmonizing high-capacity policy representation with the MaxEnt objective. It enhances both exploration robustness and representational fidelity. Empirically, on high-dimensional continuous control benchmarks, it matches state-of-the-art non-diffusion RL methods in performance, substantially outperforms existing diffusion-based RL approaches, and attains superior efficiency—evidenced by lower update-to-data ratios, reduced architectural design freedom, and diminished computational overhead.

Technology Category

Application Category

📝 Abstract

Maximum entropy reinforcement learning (MaxEnt-RL) has become the standard approach to RL due to its beneficial exploration properties. Traditionally, policies are parameterized using Gaussian distributions, which significantly limits their representational capacity. Diffusion-based policies offer a more expressive alternative, yet integrating them into MaxEnt-RL poses challenges--primarily due to the intractability of computing their marginal entropy. To overcome this, we propose Diffusion-Based Maximum Entropy RL (DIME). DIME leverages recent advances in approximate inference with diffusion models to derive a lower bound on the maximum entropy objective. Additionally, we propose a policy iteration scheme that provably converges to the optimal diffusion policy. Our method enables the use of expressive diffusion-based policies while retaining the principled exploration benefits of MaxEnt-RL, significantly outperforming other diffusion-based methods on challenging high-dimensional control benchmarks. It is also competitive with state-of-the-art non-diffusion based RL methods while requiring fewer algorithmic design choices and smaller update-to-data ratios, reducing computational complexity.

Problem

Research questions and friction points this paper is trying to address.

Enhancing policy representation in MaxEnt-RL

Addressing entropy computation intractability in diffusion models

Improving exploration in high-dimensional control tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion-based policies integration

Lower bound on entropy objective

Convergent policy iteration scheme

🔎 Similar Papers

No similar papers found.