🤖 AI Summary
This work addresses the challenge of fine-tuning text-to-image diffusion models under black-box objectives, where existing reinforcement learning methods—such as PPO and REINFORCE—suffer from an inherent trade-off among hyperparameter sensitivity, computational cost, and sample efficiency. To overcome this, we propose Leave-One-Out PPO (LOOP), the first algorithm to synergistically integrate REINFORCE’s variance-reduction mechanism with PPO’s robust sampling strategy. LOOP further incorporates importance sampling, action clipping, multi-action sampling, and baseline correction. Extensive experiments demonstrate that LOOP significantly improves generation quality across multiple black-box evaluation objectives, reduces training memory consumption by 40%, accelerates convergence by 2.1×, and matches or exceeds PPO’s performance while exhibiting markedly stronger hyperparameter robustness.
📝 Abstract
Reinforcement learning ( RL)-based fine-tuning has emerged as a powerful approach for aligning diffusion models with black-box objectives. Proximal policy optimization (PPO) is the most popular choice of method for policy optimization. While effective in terms of performance, PPO is highly sensitive to hyper-parameters and involves substantial computational overhead. REINFORCE, on the other hand, mitigates some computational complexities such as high memory overhead and sensitive hyper-parameter tuning, but has suboptimal performance due to high-variance and sample inefficiency. While the variance of the REINFORCE can be reduced by sampling multiple actions per input prompt and using a baseline correction term, it still suffers from sample inefficiency. To address these challenges, we systematically analyze the efficiency-effectiveness trade-off between REINFORCE and PPO, and propose leave-one-out PPO ( LOOP), a novel RL for diffusion fine-tuning method. LOOP combines variance reduction techniques from REINFORCE, such as sampling multiple actions per input prompt and a baseline correction term, with the robustness and sample efficiency of PPO via clipping and importance sampling. Our results demonstrate that LOOP effectively improves diffusion models on various black-box objectives, and achieves a better balance between computational efficiency and performance.