Towards Resilient Safety-driven Unlearning for Diffusion Models against Downstream Fine-tuning

📅 2025-07-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Downstream fine-tuned text-to-image diffusion models frequently revert to toxic behaviors acquired during pretraining, and existing safety-unlearning methods lack robustness in this setting. Method: We identify the root cause of this vulnerability and propose ResAlign—a robust, safety-driven unlearning framework. ResAlign employs the Moreau envelope to implicitly model the optimization process and integrates gradient estimation with meta-learning to explicitly simulate diverse fine-tuning distributions, thereby generalizing across varied fine-tuning methods and configurations. Contribution/Results: Evaluated across multiple datasets and fine-tuning protocols, ResAlign significantly outperforms baselines—maintaining high safety both after benign and harmful fine-tuning, without compromising generation quality. It is the first method to achieve persistent, stable unlearning effects, ensuring long-term safety retention under distributional shifts induced by downstream adaptation.

Technology Category

Application Category

📝 Abstract
Text-to-image (T2I) diffusion models have achieved impressive image generation quality and are increasingly fine-tuned for personalized applications. However, these models often inherit unsafe behaviors from toxic pretraining data, raising growing safety concerns. While recent safety-driven unlearning methods have made promising progress in suppressing model toxicity, they are identified to be fragile to downstream fine-tuning, where we reveal that state-of-the-art methods largely fail to retain their effectiveness even when fine-tuned on entirely benign datasets. To mitigate this problem, in this paper, we propose ResAlign, a safety-driven unlearning framework with enhanced resilience against downstream fine-tuning. By modeling downstream fine-tuning as an implicit optimization problem with a Moreau Envelope-based reformulation, ResAlign enables efficient gradient estimation to minimize the recovery of harmful behaviors. Additionally, a meta-learning strategy is proposed to simulate a diverse distribution of fine-tuning scenarios to improve generalization. Extensive experiments across a wide range of datasets, fine-tuning methods, and configurations demonstrate that ResAlign consistently outperforms prior unlearning approaches in retaining safety after downstream fine-tuning while preserving benign generation capability well.
Problem

Research questions and friction points this paper is trying to address.

Addressing unsafe behaviors in diffusion models from toxic data
Enhancing resilience of safety-driven unlearning against fine-tuning
Preventing recovery of harmful behaviors during downstream adaptation
Innovation

Methods, ideas, or system contributions that make the work stand out.

ResAlign enhances resilience against fine-tuning
Uses Moreau Envelope for gradient estimation
Meta-learning simulates diverse fine-tuning scenarios
🔎 Similar Papers
No similar papers found.
Boheng Li
Boheng Li
Nanyang Technological University
AI SecurityWatermarkingBackdoor AttackCopyright Protection
R
Renjie Gu
Central South University
J
Junjie Wang
Wuhan University
L
Leyi Qi
Zhejiang University
Y
Yiming Li
Nanyang Technological University
Run Wang
Run Wang
Integrated Systems Laboratory (IIS), ETHz
Hardware/Software Co-designTinyML
Zhan Qin
Zhan Qin
Researcher, Zhejiang University
Data Security and PrivacyAI Security
T
Tianwei Zhang
Nanyang Technological University