🤖 AI Summary
Text-to-image diffusion models frequently suffer from safety alignment failure after benign fine-tuning (e.g., LoRA-based personalization or style adaptation), yet existing evaluation frameworks lack systematic assessment of this deployment-stage scenario. Method: We conduct the first systematic investigation into the widespread collapse of mainstream safety alignment techniques under such post-deployment fine-tuning, and propose SPQR—a novel, unified single-score benchmark enabling multilingual, cross-domain, and out-of-distribution generalization evaluation. SPQR jointly quantifies safety compliance, prompt adherence, image fidelity, and robustness via fine-grained category decomposition, domain perturbation analysis, and quantitative-qualitative co-evaluation. Contribution/Results: SPQR enables reproducible, standardized cross-method ranking, significantly enhancing the reliability verification of safety-aligned models in real-world applications. It is the first benchmark to holistically address alignment degradation under practical fine-tuning conditions.
📝 Abstract
Text-to-image diffusion models can emit copyrighted, unsafe, or private content. Safety alignment aims to suppress specific concepts, yet evaluations seldom test whether safety persists under benign downstream fine-tuning routinely applied after deployment (e.g., LoRA personalization, style/domain adapters). We study the stability of current safety methods under benign fine-tuning and observe frequent breakdowns. As true safety alignment must withstand even benign post-deployment adaptations, we introduce the SPQR benchmark (Safety-Prompt adherence-Quality-Robustness). SPQR is a single-scored metric that provides a standardized and reproducible framework to evaluate how well safety-aligned diffusion models preserve safety, utility, and robustness under benign fine-tuning, by reporting a single leaderboard score to facilitate comparisons. We conduct multilingual, domain-specific, and out-of-distribution analyses, along with category-wise breakdowns, to identify when safety alignment fails after benign fine-tuning, ultimately showcasing SPQR as a concise yet comprehensive benchmark for T2I safety alignment techniques for T2I models.