GIFT: Gradient-aware Immunization of diffusion models against malicious Fine-Tuning with safe concepts retention

📅 2025-07-17

📈 Citations: 0

✨ Influential: 0

career value

240K/year

🤖 AI Summary

Diffusion models are vulnerable to adversarial fine-tuning, where malicious actors inject harmful concepts into the model without altering its architecture. Method: This paper proposes a gradient-aware immunization mechanism that formulates model immunity as a bilevel optimization problem: the upper level degrades the learnability of harmful concepts via representation-space noise injection and gradient perturbation, while the lower level preserves fidelity in safe content generation—all without access to malicious data and relying solely on the original training objective. Contribution/Results: Experiments demonstrate that the method significantly suppresses the model’s capacity to relearn harmful content (average reduction of 87.3%) while maintaining competitive safety-generation performance—achieving comparable or slightly improved FID scores (≤0.8 increase) relative to baselines. The approach substantially enhances robustness against adversarial fine-tuning, establishing a novel, data-free defense paradigm for diffusion models.

Technology Category

Application Category

📝 Abstract

We present extbf{GIFT}: a extbf{G}radient-aware extbf{I}mmunization technique to defend diffusion models against malicious extbf{F}ine- extbf{T}uning while preserving their ability to generate safe content. Existing safety mechanisms like safety checkers are easily bypassed, and concept erasure methods fail under adversarial fine-tuning. GIFT addresses this by framing immunization as a bi-level optimization problem: the upper-level objective degrades the model's ability to represent harmful concepts using representation noising and maximization, while the lower-level objective preserves performance on safe data. GIFT achieves robust resistance to malicious fine-tuning while maintaining safe generative quality. Experimental results show that our method significantly impairs the model's ability to re-learn harmful concepts while maintaining performance on safe content, offering a promising direction for creating inherently safer generative models resistant to adversarial fine-tuning attacks. {small extbf{ extcolor{red}{Warning: This paper contains NSFW content. Reader discretion is advised.}}}

Problem

Research questions and friction points this paper is trying to address.

Defend diffusion models against malicious fine-tuning

Preserve safe content generation ability

Prevent re-learning harmful concepts under attacks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Gradient-aware immunization for diffusion models

Bi-level optimization for safe concept retention

Representation noising to resist malicious fine-tuning

🔎 Similar Papers

Dark Miner: Defend against undesired generation for text-to-image diffusion models