The Illusion of Forgetting: Attack Unlearned Diffusion via Initial Latent Variable Optimization

📅 2026-01-30

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work challenges the efficacy of existing forgetting-based defense methods for diffusion models, which claim to remove unsafe content but may merely suppress—rather than erase—underlying knowledge, leaving it dormant. To expose this vulnerability, the authors propose the IVO attack framework, which reconstructs corrupted language-to-knowledge mappings by optimizing initial latent variables to reactivate dormant NSFW concepts. IVO represents the first systematic demonstration that current forgetting mechanisms in diffusion models suffer from a fundamental illusion of safety. By integrating image inversion with adversarial optimization, IVO aligns the noise distributions of original and unlearned models, establishing a general-purpose attack paradigm. Experiments across eight state-of-the-art forgetting methods show that IVO significantly improves attack success rates while preserving high semantic fidelity, thereby revealing critical security flaws in contemporary defenses.

Technology Category

Application Category

📝 Abstract

Although unlearning-based defenses claim to purge Not-Safe-For-Work (NSFW) concepts from diffusion models (DMs), we reveals that this"forgetting"is largely an illusion. Unlearning partially disrupts the mapping between linguistic symbols and the underlying knowledge, which remains intact as dormant memories. We find that the distributional discrepancy in the denoising process serves as a measurable indicator of how much of the mapping is retained, also reflecting the strength of unlearning. Inspired by this, we propose IVO (Initial Latent Variable Optimization), a concise and powerful attack framework that reactivates these dormant memories by reconstructing the broken mappings. Through Image Inversion}, Adversarial Optimization and Reused Attack, IVO optimizes initial latent variables to realign the noise distribution of unlearned models with their original unsafe states. Extensive experiments across 8 widely used unlearning techniques demonstrate that IVO achieves superior attack success rates and strong semantic consistency, exposing fundamental flaws in current defenses. The code is available at anonymous.4open.science/r/IVO/. Warning: This paper has unsafe images that may offend some readers.

Problem

Research questions and friction points this paper is trying to address.

unlearning

diffusion models

NSFW

dormant memories

distributional discrepancy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unlearning

Diffusion Models

Latent Variable Optimization