VARD: Efficient and Dense Fine-Tuning for Diffusion Models with Value-based RL

📅 2025-05-21

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Diffusion models struggle with non-differentiable and sparse reward signals in reinforcement learning (RL) settings. To address this, this paper proposes the first value-function-based RL fine-tuning framework for diffusion models. Our method integrates a learned value function into the diffusion process to estimate the expected cumulative reward along denoising trajectories, thereby providing dense, differentiable supervision. Coupled with KL-divergence regularization, the framework enables stable end-to-end training and accommodates non-differentiable reward functions. By unifying backpropagation-driven policy optimization with joint diffusion-RL training, our approach significantly improves both generation quality and training efficiency. On text-guided image editing and attribute-controlled generation tasks, it achieves state-of-the-art performance—surpassing prior methods in both PSNR and CLIP-Score—while accelerating training by 42%.

Technology Category

Application Category

📝 Abstract

Diffusion models have emerged as powerful generative tools across various domains, yet tailoring pre-trained models to exhibit specific desirable properties remains challenging. While reinforcement learning (RL) offers a promising solution,current methods struggle to simultaneously achieve stable, efficient fine-tuning and support non-differentiable rewards. Furthermore, their reliance on sparse rewards provides inadequate supervision during intermediate steps, often resulting in suboptimal generation quality. To address these limitations, dense and differentiable signals are required throughout the diffusion process. Hence, we propose VAlue-based Reinforced Diffusion (VARD): a novel approach that first learns a value function predicting expection of rewards from intermediate states, and subsequently uses this value function with KL regularization to provide dense supervision throughout the generation process. Our method maintains proximity to the pretrained model while enabling effective and stable training via backpropagation. Experimental results demonstrate that our approach facilitates better trajectory guidance, improves training efficiency and extends the applicability of RL to diffusion models optimized for complex, non-differentiable reward functions.

Problem

Research questions and friction points this paper is trying to address.

Challenges in tailoring pre-trained diffusion models for specific properties

Limitations of current RL methods in stable, efficient fine-tuning

Need for dense supervision to improve generation quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses value-based RL for dense supervision

KL regularization ensures stable fine-tuning

Supports non-differentiable rewards efficiently

🔎 Similar Papers

No similar papers found.