You Don't Need All That Attention: Surgical Memorization Mitigation in Text-to-Image Diffusion Models

📅 2026-02-23

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Text-to-image diffusion models are prone to reproducing training data during generation, posing significant privacy and copyright risks. To address this, this work proposes GUARD, a framework that mitigates memorization during inference by dynamically steering the denoising process away from training samples while preserving textual semantic fidelity. The core innovation lies in a surgical intervention mechanism based on attention attenuation: a novel statistical method automatically identifies prompt tokens requiring mitigation and precisely attenuates their corresponding cross-attention weights, enabling fine-grained and dynamic memory suppression. Experiments demonstrate that GUARD achieves state-of-the-art performance in mitigating both verbatim and templated memorization across two mainstream diffusion architectures, all while maintaining or even enhancing image generation quality.

Technology Category

Application Category

📝 Abstract

Generative models have been shown to "memorize" certain training data, leading to verbatim or near-verbatim generating images, which may cause privacy concerns or copyright infringement. We introduce Guidance Using Attractive-Repulsive Dynamics (GUARD), a novel framework for memorization mitigation in text-to-image diffusion models. GUARD adjusts the image denoising process to guide the generation away from an original training image and towards one that is distinct from training data while remaining aligned with the prompt, guarding against reproducing training data, without hurting image generation quality. We propose a concrete instantiation of this framework, where the positive target that we steer towards is given by a novel method for (cross) attention attenuation based on (i) a novel statistical mechanism that automatically identifies the prompt positions where cross attention must be attenuated and (ii) attenuating cross-attention in these per-prompt locations. The resulting GUARD offers a surgical, dynamic per-prompt inference-time approach that, we find, is by far the most robust method in terms of consistently producing state-of-the-art results for memorization mitigation across two architectures and for both verbatim and template memorization, while also improving upon or yielding comparable results in terms of image quality.

Problem

Research questions and friction points this paper is trying to address.

memorization

text-to-image diffusion models

privacy

training data leakage

Innovation

Methods, ideas, or system contributions that make the work stand out.

memorization mitigation

diffusion models

cross-attention attenuation

inference-time intervention