Step-level Reward for Free in RL-based T2I Diffusion Model Fine-tuning

πŸ“… 2025-05-25
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing reinforcement learning (RL)-based fine-tuning of text-to-image diffusion models suffers from sparse reward signals: each generation yields only a single delayed reward, hindering step-level action attribution and resulting in inefficient training. To address this, we propose a model-free, architecture-free dynamic dense reward allocation mechanism. Our approach introduces a novel step-wise credit assignment framework grounded in the cosine similarity change between intermediate and final denoised images, augmented by reward shaping to emphasize critical denoising steps. Without degrading the original policy’s performance, our method improves sample efficiency by 1.25–2Γ— and demonstrates superior generalization across four human preference-based reward functions. It effectively mitigates both inaccurate step-level attribution and training inefficiency inherent in sparse-reward RL fine-tuning of diffusion models.

Technology Category

Application Category

πŸ“ Abstract
Recent advances in text-to-image (T2I) diffusion model fine-tuning leverage reinforcement learning (RL) to align generated images with learnable reward functions. The existing approaches reformulate denoising as a Markov decision process for RL-driven optimization. However, they suffer from reward sparsity, receiving only a single delayed reward per generated trajectory. This flaw hinders precise step-level attribution of denoising actions, undermines training efficiency. To address this, we propose a simple yet effective credit assignment framework that dynamically distributes dense rewards across denoising steps. Specifically, we track changes in cosine similarity between intermediate and final images to quantify each step's contribution on progressively reducing the distance to the final image. Our approach avoids additional auxiliary neural networks for step-level preference modeling and instead uses reward shaping to highlight denoising phases that have a greater impact on image quality. Our method achieves 1.25 to 2 times higher sample efficiency and better generalization across four human preference reward functions, without compromising the original optimal policy.
Problem

Research questions and friction points this paper is trying to address.

Addresses reward sparsity in RL-based T2I diffusion fine-tuning
Proposes dynamic dense reward assignment for denoising steps
Enhances training efficiency without auxiliary neural networks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic dense reward distribution for denoising steps
Cosine similarity tracks step contributions to final image
Reward shaping highlights impactful denoising phases
πŸ”Ž Similar Papers
No similar papers found.
Xinyao Liao
Xinyao Liao
Huazhong University of Science and Technology
W
Wei Wei
Huazhong University of Science and Technology
Xiaoye Qu
Xiaoye Qu
Shanghai AI Lab
Y
Yu Cheng
The Chinese University of Hong Kong