Learning a Diffusion Model Policy from Rewards via Q-Score Matching

📅 2023-12-18
🏛️ International Conference on Machine Learning
📈 Citations: 14
Influential: 4
📄 PDF
🤖 AI Summary
Existing diffusion-based policies in offline reinforcement learning suffer from behavior-cloning-style training, failing to exploit the structural properties of their score functions—resulting in limited expressivity and inefficient training. This work introduces Q-score matching, the first method to establish a theoretical equivalence between the score function of a diffusion policy and the action gradient of the Q-function. Consequently, policy updates are achieved solely via differentiation of the denoising model, eliminating the need for end-to-end backpropagation through diffusion sampling. The approach inherently supports implicit multimodal policies and exploration-aware updates, significantly improving training efficiency and stability. Evaluated on continuous-control benchmark tasks, our method outperforms state-of-the-art offline RL baselines, demonstrating superior robustness, convergence properties, and capability to model multimodal behavioral distributions.
📝 Abstract
Diffusion models have become a popular choice for representing actor policies in behavior cloning and offline reinforcement learning. This is due to their natural ability to optimize an expressive class of distributions over a continuous space. However, previous works fail to exploit the score-based structure of diffusion models, and instead utilize a simple behavior cloning term to train the actor, limiting their ability in the actor-critic setting. In this paper, we present a theoretical framework linking the structure of diffusion model policies to a learned Q-function, by linking the structure between the score of the policy to the action gradient of the Q-function. We focus on off-policy reinforcement learning and propose a new policy update method from this theory, which we denote Q-score matching. Notably, this algorithm only needs to differentiate through the denoising model rather than the entire diffusion model evaluation, and converged policies through Q-score matching are implicitly multi-modal and explorative in continuous domains. We conduct experiments in simulated environments to demonstrate the viability of our proposed method and compare to popular baselines. Source code is available from the project website: https://michaelpsenka.io/qsm.
Problem

Research questions and friction points this paper is trying to address.

Link diffusion model policies to Q-function
Improve policy update in off-policy reinforcement learning
Introduce Q-score matching for efficient training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion model policy
Q-score matching
Off-policy reinforcement learning
🔎 Similar Papers
No similar papers found.
Michael Psenka
Michael Psenka
PhD Student, EECS, UC Berkeley
deep learningartificial intelligencegeometry
A
Alejandro Escontrela
Department of Electrical Engineering and Computer Science, University of California, Berkeley
Pieter Abbeel
Pieter Abbeel
UC Berkeley | Covariant
RoboticsMachine LearningAI
Y
Yi Ma
Department of Electrical Engineering and Computer Science, University of California, Berkeley