Towards Resilient Safety-driven Unlearning for Diffusion Models against Downstream Fine-tuning

📅 2025-07-22

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

Downstream fine-tuned text-to-image diffusion models frequently revert to toxic behaviors acquired during pretraining, and existing safety-unlearning methods lack robustness in this setting. Method: We identify the root cause of this vulnerability and propose ResAlign—a robust, safety-driven unlearning framework. ResAlign employs the Moreau envelope to implicitly model the optimization process and integrates gradient estimation with meta-learning to explicitly simulate diverse fine-tuning distributions, thereby generalizing across varied fine-tuning methods and configurations. Contribution/Results: Evaluated across multiple datasets and fine-tuning protocols, ResAlign significantly outperforms baselines—maintaining high safety both after benign and harmful fine-tuning, without compromising generation quality. It is the first method to achieve persistent, stable unlearning effects, ensuring long-term safety retention under distributional shifts induced by downstream adaptation.

Technology Category

Application Category

📝 Abstract

Text-to-image (T2I) diffusion models have achieved impressive image generation quality and are increasingly fine-tuned for personalized applications. However, these models often inherit unsafe behaviors from toxic pretraining data, raising growing safety concerns. While recent safety-driven unlearning methods have made promising progress in suppressing model toxicity, they are identified to be fragile to downstream fine-tuning, where we reveal that state-of-the-art methods largely fail to retain their effectiveness even when fine-tuned on entirely benign datasets. To mitigate this problem, in this paper, we propose ResAlign, a safety-driven unlearning framework with enhanced resilience against downstream fine-tuning. By modeling downstream fine-tuning as an implicit optimization problem with a Moreau Envelope-based reformulation, ResAlign enables efficient gradient estimation to minimize the recovery of harmful behaviors. Additionally, a meta-learning strategy is proposed to simulate a diverse distribution of fine-tuning scenarios to improve generalization. Extensive experiments across a wide range of datasets, fine-tuning methods, and configurations demonstrate that ResAlign consistently outperforms prior unlearning approaches in retaining safety after downstream fine-tuning while preserving benign generation capability well.

Problem

Research questions and friction points this paper is trying to address.

Addressing unsafe behaviors in diffusion models from toxic data

Enhancing resilience of safety-driven unlearning against fine-tuning

Preventing recovery of harmful behaviors during downstream adaptation

Innovation

Methods, ideas, or system contributions that make the work stand out.

ResAlign enhances resilience against fine-tuning

Uses Moreau Envelope for gradient estimation

Meta-learning simulates diverse fine-tuning scenarios

🔎 Similar Papers

Unstable Unlearning: The Hidden Risk of Concept Resurgence in Diffusion Models