Diffusion Fine-Tuning via Reparameterized Policy Gradient of the Soft Q-Function

๐Ÿ“… 2025-12-04
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Diffusion models often yield high likelihoods but suffer from poor alignment with downstream objectives; existing fine-tuning approaches frequently induce reward over-optimization, degrading sample naturalness and diversity. To address this, we propose a KL-regularized reinforcement learning fine-tuning framework based on soft Q-function reparameterized policy gradients. Our method employs learnable soft Q-estimation, discount factor modeling, and consistency model-enhanced Q-value accuracy, coupled with an off-policy replay buffer to improve sample efficiency. Crucially, it jointly optimizes the target reward while explicitly constraining the KL divergence between the fine-tuned distribution and the pre-trained priorโ€”thereby balancing alignment, naturalness, and diversity. Experiments demonstrate substantial improvements in reward scores on text-to-image alignment and black-box optimization tasks, while preserving high sample fidelity, broad mode coverage, and sample efficiency.

Technology Category

Application Category

๐Ÿ“ Abstract
Diffusion models excel at generating high-likelihood samples but often require alignment with downstream objectives. Existing fine-tuning methods for diffusion models significantly suffer from reward over-optimization, resulting in high-reward but unnatural samples and degraded diversity. To mitigate over-optimization, we propose extbf{Soft Q-based Diffusion Finetuning (SQDF)}, a novel KL-regularized RL method for diffusion alignment that applies a reparameterized policy gradient of a training-free, differentiable estimation of the soft Q-function. SQDF is further enhanced with three innovations: a discount factor for proper credit assignment in the denoising process, the integration of consistency models to refine Q-function estimates, and the use of an off-policy replay buffer to improve mode coverage and manage the reward-diversity trade-off. Our experiments demonstrate that SQDF achieves superior target rewards while preserving diversity in text-to-image alignment. Furthermore, in online black-box optimization, SQDF attains high sample efficiency while maintaining naturalness and diversity.
Problem

Research questions and friction points this paper is trying to address.

Aligns diffusion models with downstream objectives
Mitigates reward over-optimization in fine-tuning
Balances reward achievement with sample diversity
Innovation

Methods, ideas, or system contributions that make the work stand out.

KL-regularized RL method for diffusion alignment
Reparameterized policy gradient of soft Q-function
Discount factor, consistency models, off-policy replay buffer