🤖 AI Summary
Downstream fine-tuned text-to-image diffusion models frequently revert to toxic behaviors acquired during pretraining, and existing safety-unlearning methods lack robustness in this setting. Method: We identify the root cause of this vulnerability and propose ResAlign—a robust, safety-driven unlearning framework. ResAlign employs the Moreau envelope to implicitly model the optimization process and integrates gradient estimation with meta-learning to explicitly simulate diverse fine-tuning distributions, thereby generalizing across varied fine-tuning methods and configurations. Contribution/Results: Evaluated across multiple datasets and fine-tuning protocols, ResAlign significantly outperforms baselines—maintaining high safety both after benign and harmful fine-tuning, without compromising generation quality. It is the first method to achieve persistent, stable unlearning effects, ensuring long-term safety retention under distributional shifts induced by downstream adaptation.
📝 Abstract
Text-to-image (T2I) diffusion models have achieved impressive image generation quality and are increasingly fine-tuned for personalized applications. However, these models often inherit unsafe behaviors from toxic pretraining data, raising growing safety concerns. While recent safety-driven unlearning methods have made promising progress in suppressing model toxicity, they are identified to be fragile to downstream fine-tuning, where we reveal that state-of-the-art methods largely fail to retain their effectiveness even when fine-tuned on entirely benign datasets. To mitigate this problem, in this paper, we propose ResAlign, a safety-driven unlearning framework with enhanced resilience against downstream fine-tuning. By modeling downstream fine-tuning as an implicit optimization problem with a Moreau Envelope-based reformulation, ResAlign enables efficient gradient estimation to minimize the recovery of harmful behaviors. Additionally, a meta-learning strategy is proposed to simulate a diverse distribution of fine-tuning scenarios to improve generalization. Extensive experiments across a wide range of datasets, fine-tuning methods, and configurations demonstrate that ResAlign consistently outperforms prior unlearning approaches in retaining safety after downstream fine-tuning while preserving benign generation capability well.