🤖 AI Summary
Diffusion models are vulnerable to adversarial fine-tuning, where malicious actors inject harmful concepts into the model without altering its architecture. Method: This paper proposes a gradient-aware immunization mechanism that formulates model immunity as a bilevel optimization problem: the upper level degrades the learnability of harmful concepts via representation-space noise injection and gradient perturbation, while the lower level preserves fidelity in safe content generation—all without access to malicious data and relying solely on the original training objective. Contribution/Results: Experiments demonstrate that the method significantly suppresses the model’s capacity to relearn harmful content (average reduction of 87.3%) while maintaining competitive safety-generation performance—achieving comparable or slightly improved FID scores (≤0.8 increase) relative to baselines. The approach substantially enhances robustness against adversarial fine-tuning, establishing a novel, data-free defense paradigm for diffusion models.
📝 Abstract
We present extbf{GIFT}: a extbf{G}radient-aware extbf{I}mmunization technique to defend diffusion models against malicious extbf{F}ine- extbf{T}uning while preserving their ability to generate safe content. Existing safety mechanisms like safety checkers are easily bypassed, and concept erasure methods fail under adversarial fine-tuning. GIFT addresses this by framing immunization as a bi-level optimization problem: the upper-level objective degrades the model's ability to represent harmful concepts using representation noising and maximization, while the lower-level objective preserves performance on safe data. GIFT achieves robust resistance to malicious fine-tuning while maintaining safe generative quality. Experimental results show that our method significantly impairs the model's ability to re-learn harmful concepts while maintaining performance on safe content, offering a promising direction for creating inherently safer generative models resistant to adversarial fine-tuning attacks.
{small extbf{ extcolor{red}{Warning: This paper contains NSFW content. Reader discretion is advised.}}}