🤖 AI Summary
This work addresses the challenge of unsafe content generation in diffusion models pretrained on large-scale data by proposing an online reinforcement learning framework that operates without supervised safety data or dedicated reward models. Building upon Group Relative Policy Optimization (GRPO), the method constructs guidance rewards based on directional cues in the CLIP embedding space, enabling online post-training with paired positive and negative text prompts. Notably, it introduces a novel safety alignment mechanism that requires no paired training data and effectively mitigates catastrophic forgetting. Experimental results demonstrate a significant reduction in unsafe generations—decreasing the proportion of inappropriate content from 48.9% to 18.07% and detected nudity instances from 646 to 15—while simultaneously improving GenEval composite generation quality to 47.83%. The approach achieves state-of-the-art performance across all seven evaluated harm categories.
📝 Abstract
Diffusion models have been widely studied for removing unsafe content learned during pre-training. Existing methods require expensive supervised data, either unsafe-text paired with safe-image groundtruth or negative/positive image pairs, making them impractical to scale. Furthermore, offline reinforcement learning and supervised fine-tuning approaches that generate synthetic data offline suffer from catastrophic forgetting, degrading generation quality. We propose a novel online reinforcement learning framework that addresses both data scarcity and model degradation through post-training with Group Relative Policy Optimization (GRPO) on both negative and positive text prompts. To eliminate the need for fine-tuning specialized safe/unsafe reward models, we introduce a \textit{steering reward mechanism} that exploits an inherent property of CLIP embeddings: steering text representations toward positive safety directions and away from negative ones in the embedding space. Our online-policy approach enables the model to learn from diverse prompts, including explicit unsafe content, without catastrophic forgetting. Extensive experiments demonstrate that our method reduces inappropriate content to 18.07\% (vs. 48.9\% for SD v1.4) and nudity detections to 15 (vs. 646 baseline) while improving compositional generation quality from 42.08\% to 47.83\% on GenEval. Remarkably, these safety gains generalize to out-of-domain unsafe prompts across seven harm categories, achieving state-of-the-art performance without supervised paired data or reward tuning. Github: https://github.com/MAXNORM8650/SafeDiffusion-R1.