🤖 AI Summary
This work addresses the challenge of effectively erasing undesirable concepts from flow-matching-based text-to-image and text-to-video diffusion models built upon Transformer architectures, without compromising generation quality. The authors formulate concept erasure as a constrained multi-objective optimization problem that balances removal efficacy with generative fidelity. They propose a novel forgetting strategy integrating implicit gradient surgery, LoRA-based efficient fine-tuning, and attention regularization, complemented by an anchor-propagation mechanism to consistently propagate erasure effects across spatial and temporal dimensions. As the first unified framework supporting both image and video diffusion models, the method achieves state-of-the-art performance across multiple benchmarks, significantly outperforming existing approaches in erasure effectiveness, generation fidelity, and temporal consistency.
📝 Abstract
Removing undesired concepts from large-scale text-to-image (T2I) and text-to-video (T2V) diffusion models while preserving overall generative quality remains a major challenge, particularly as modern models such as Stable Diffusion v3, Flux, and OpenSora employ flow-matching and transformer-based architectures and extend to long-horizon video generation. Existing concept erasure methods, designed for earlier T2I/T2V models, often fail to generalize to these paradigms. To address this issue, we propose EraseAnything++, a unified framework for concept erasure in both image and video diffusion models with flow-matching objectives. Central to our approach is formulating concept erasure as a constrained multi-objective optimization problem that explicitly balances concept removal with preservation of generative utility. To solve the resulting conflicting objectives, we introduce an efficient utility-preserving unlearning strategy based on implicit gradient surgery. Furthermore, by integrating LoRA-based parameter tuning with attention-level regularization, our method anchors erasure on key visual representations and propagates it consistently across spatial and temporal dimensions. In the video setting, we further enhance consistency through an anchor-and-propagate mechanism that initializes erasure on reference frames and enforces it throughout subsequent transformer layers, thereby mitigating temporal drift. Extensive experiments on both image and video benchmarks demonstrate that EraseAnything++ substantially outperforms prior methods in erasure effectiveness, generative fidelity, and temporal consistency, establishing a new state of the art for concept erasure in next-generation diffusion models.