🤖 AI Summary
This study investigates the impact of unsafe image proportions in training data on the safety of text-to-image generative models. By constructing controlled datasets and training multiple model variants with contamination rates ranging from 0% to 9.6% while holding other factors constant, the authors evaluate output safety using four independent safety classifiers, ablation studies of text encoders (including SafeCLIP), and quality metrics such as FID, CLIPScore, and ImageReward. They reveal, for the first time, a monotonic dose–response relationship between training contamination level and output unsafety. Notably, even with zero contamination, a baseline risk of 16.6% persists, which SafeCLIP reduces to 9.6% without compromising generation quality. These findings demonstrate that the text encoder itself constitutes an inherent safety risk, offering a new perspective for safer model design.
📝 Abstract
Text-to-image models trained on large-scale data often inevitably ingest unsafe content. While some people observe input-output amplifications, it remains unclear whether and how training data composition directly drives model output safety or by other factors. We shed light on this question by isolating this variable: we train the same text-to-image model on datasets that differ \emph{only} in their fraction of unsafe images (0\% to 9.6\%), across several dataset scales (100K to 8M). Then we generate images with the resulting models, and evaluate them with four independent safety classifiers. Output unsafety rises monotonically from 16.6\% at 0\% contamination to 25.5\% at 5\%. A factorial design reveals that the \emph{proportion}, not the absolute count, of unsafe training images is the operative variable. The 16.6\% irreducible baseline at zero contamination implicates the other components, e.g. frozen text encoder, as a residual safety risk -- confirmed by a text encoder ablation showing that SafeCLIP reduces this floor to 9.6\%, while the dose-response effect persists across all three encoders tested. Critically, no quality degradation in terms of FID, CLIPscore and ImageReward accompanies safety filtering. These results establish that data curation and text encoder safety are complementary and independently effective interventions. At the same time, the remaining level of unsafety poses questions for future research about emerging capabilities and compositionality.