A Simple and Effective Reinforcement Learning Method for Text-to-Image Diffusion Fine-tuning

📅 2025-03-02

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

This work addresses the challenge of fine-tuning text-to-image diffusion models under black-box objectives, where existing reinforcement learning methods—such as PPO and REINFORCE—suffer from an inherent trade-off among hyperparameter sensitivity, computational cost, and sample efficiency. To overcome this, we propose Leave-One-Out PPO (LOOP), the first algorithm to synergistically integrate REINFORCE’s variance-reduction mechanism with PPO’s robust sampling strategy. LOOP further incorporates importance sampling, action clipping, multi-action sampling, and baseline correction. Extensive experiments demonstrate that LOOP significantly improves generation quality across multiple black-box evaluation objectives, reduces training memory consumption by 40%, accelerates convergence by 2.1×, and matches or exceeds PPO’s performance while exhibiting markedly stronger hyperparameter robustness.

Technology Category

Application Category

📝 Abstract

Reinforcement learning ( RL)-based fine-tuning has emerged as a powerful approach for aligning diffusion models with black-box objectives. Proximal policy optimization (PPO) is the most popular choice of method for policy optimization. While effective in terms of performance, PPO is highly sensitive to hyper-parameters and involves substantial computational overhead. REINFORCE, on the other hand, mitigates some computational complexities such as high memory overhead and sensitive hyper-parameter tuning, but has suboptimal performance due to high-variance and sample inefficiency. While the variance of the REINFORCE can be reduced by sampling multiple actions per input prompt and using a baseline correction term, it still suffers from sample inefficiency. To address these challenges, we systematically analyze the efficiency-effectiveness trade-off between REINFORCE and PPO, and propose leave-one-out PPO ( LOOP), a novel RL for diffusion fine-tuning method. LOOP combines variance reduction techniques from REINFORCE, such as sampling multiple actions per input prompt and a baseline correction term, with the robustness and sample efficiency of PPO via clipping and importance sampling. Our results demonstrate that LOOP effectively improves diffusion models on various black-box objectives, and achieves a better balance between computational efficiency and performance.

Problem

Research questions and friction points this paper is trying to address.

Addresses hyper-parameter sensitivity in PPO for diffusion models.

Reduces computational overhead and memory usage in RL fine-tuning.

Improves sample efficiency and performance in text-to-image diffusion.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines REINFORCE and PPO for efficiency

Introduces LOOP with variance reduction techniques

Balances computational efficiency and model performance

🔎 Similar Papers

No similar papers found.