Beyond the Dirac Delta: Mitigating Diversity Collapse in Reinforcement Fine-Tuning for Versatile Image Generation

📅 2026-01-18
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the collapse of generation diversity in reinforcement fine-tuning, where optimization dynamics often drive model outputs toward a single solution (i.e., a Dirac delta distribution) due to misalignment between the objective function and the optimization landscape. To mitigate this, we propose DRIFT, the first framework to systematically incorporate diversity incentives into reinforcement fine-tuning. DRIFT synergistically preserves both task alignment and output diversity during policy updates through reward-concentrated subset sampling, stochastic prompt augmentation, and potential-based reward shaping. Experimental results demonstrate that DRIFT achieves Pareto superiority: it improves generation diversity by 9.08%–43.46% while maintaining equivalent task alignment, or enhances task alignment by 59.65%–65.86% under comparable diversity levels.

Technology Category

Application Category

📝 Abstract
Reinforcement learning (RL) has emerged as a powerful paradigm for fine-tuning large-scale generative models, such as diffusion and flow models, to align with complex human preferences and user-specified tasks. A fundamental limitation remains \textit{the curse of diversity collapse}, where the objective formulation and optimization landscape inherently collapse the policy to a Dirac delta distribution. To address this challenge, we propose \textbf{DRIFT} (\textbf{D}ive\textbf{R}sity-\textbf{I}ncentivized Reinforcement \textbf{F}ine-\textbf{T}uning for Versatile Image Generation), an innovative framework that systematically incentivizes output diversity throughout the on-policy fine-tuning process, reconciling strong task alignment with high generation diversity to enhance versatility essential for applications that demand diverse candidate generations. We approach the problem across three representative perspectives: i) \textbf{sampling} a reward-concentrated subset that filters out reward outliers to prevent premature collapse; ii) \textbf{prompting} with stochastic variations to expand the conditioning space, and iii) \textbf{optimization} of the intra-group diversity with a potential-based reward shaping mechanism. Experimental results show that DRIFT achieves superior Pareto dominance regarding task alignment and generation diversity, yielding a $ 9.08\%\!\sim\! 43.46\%$ increase in diversity at equivalent alignment levels and a $ 59.65\% \!\sim\! 65.86\%$ increase in alignment at equivalent levels of diversity.
Problem

Research questions and friction points this paper is trying to address.

diversity collapse
reinforcement fine-tuning
image generation
Dirac delta distribution
generative models
Innovation

Methods, ideas, or system contributions that make the work stand out.

diversity collapse
reinforcement fine-tuning
reward shaping
stochastic prompting
on-policy learning
🔎 Similar Papers
No similar papers found.