๐ค AI Summary
Diffusion models often yield high likelihoods but suffer from poor alignment with downstream objectives; existing fine-tuning approaches frequently induce reward over-optimization, degrading sample naturalness and diversity. To address this, we propose a KL-regularized reinforcement learning fine-tuning framework based on soft Q-function reparameterized policy gradients. Our method employs learnable soft Q-estimation, discount factor modeling, and consistency model-enhanced Q-value accuracy, coupled with an off-policy replay buffer to improve sample efficiency. Crucially, it jointly optimizes the target reward while explicitly constraining the KL divergence between the fine-tuned distribution and the pre-trained priorโthereby balancing alignment, naturalness, and diversity. Experiments demonstrate substantial improvements in reward scores on text-to-image alignment and black-box optimization tasks, while preserving high sample fidelity, broad mode coverage, and sample efficiency.
๐ Abstract
Diffusion models excel at generating high-likelihood samples but often require alignment with downstream objectives. Existing fine-tuning methods for diffusion models significantly suffer from reward over-optimization, resulting in high-reward but unnatural samples and degraded diversity. To mitigate over-optimization, we propose extbf{Soft Q-based Diffusion Finetuning (SQDF)}, a novel KL-regularized RL method for diffusion alignment that applies a reparameterized policy gradient of a training-free, differentiable estimation of the soft Q-function. SQDF is further enhanced with three innovations: a discount factor for proper credit assignment in the denoising process, the integration of consistency models to refine Q-function estimates, and the use of an off-policy replay buffer to improve mode coverage and manage the reward-diversity trade-off. Our experiments demonstrate that SQDF achieves superior target rewards while preserving diversity in text-to-image alignment. Furthermore, in online black-box optimization, SQDF attains high sample efficiency while maintaining naturalness and diversity.