Diffusion Actor-Critic: Formulating Constrained Policy Iteration as Diffusion Noise Regression for Offline Reinforcement Learning

📅 2024-05-31

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

200K/year

🤖 AI Summary

To address value overestimation caused by out-of-distribution actions and the difficulty of end-to-end optimization in KL-constrained policy iteration for offline reinforcement learning, this paper reformulates KL-regularized policy updates as a differentiable diffusion-based noise regression task—enabling full diffusion-model parameterization of the target policy. We introduce a soft Q-gradient guidance mechanism, jointly integrated with Q-function ensembling and lower-confidence-bound (LCB) estimation, to preserve policy multimodality while enhancing training stability. Our approach unifies diffusion models, the Actor-Critic framework, and KL-constrained optimization into a single coherent architecture. Evaluated on the D4RL benchmark, it consistently outperforms existing methods across nearly all tasks, achieving state-of-the-art performance.

Technology Category

Application Category

📝 Abstract

In offline reinforcement learning, it is necessary to manage out-of-distribution actions to prevent overestimation of value functions. One class of methods, the policy-regularized method, addresses this problem by constraining the target policy to stay close to the behavior policy. Although several approaches suggest representing the behavior policy as an expressive diffusion model to boost performance, it remains unclear how to regularize the target policy given a diffusion-modeled behavior sampler. In this paper, we propose Diffusion Actor-Critic (DAC) that formulates the Kullback-Leibler (KL) constraint policy iteration as a diffusion noise regression problem, enabling direct representation of target policies as diffusion models. Our approach follows the actor-critic learning paradigm in which we alternatively train a diffusion-modeled target policy and a critic network. The actor training loss includes a soft Q-guidance term from the Q-gradient. The soft Q-guidance is based on the theoretical solution of the KL constraint policy iteration, which prevents the learned policy from taking out-of-distribution actions. We demonstrate that such diffusion-based policy constraint, along with the coupling of the lower confidence bound of the Q-ensemble as value targets, not only preserves the multi-modality of target policies, but also contributes to stable convergence and strong performance in DAC. Our approach is evaluated on D4RL benchmarks and outperforms the state-of-the-art in nearly all environments. Code is available at https://github.com/Fang-Lin93/DAC.

Problem

Research questions and friction points this paper is trying to address.

Manages out-of-distribution actions in offline RL.

Formulates KL constraint policy iteration as diffusion noise.

Enhances policy performance and stability with diffusion models.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion noise regression policy

Actor-critic learning paradigm

Soft Q-guidance term

🔎 Similar Papers

No similar papers found.