Steering Away from Memorization: Reachability-Constrained Reinforcement Learning for Text-to-Image Diffusion

📅 2026-02-24

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work addresses the tendency of text-to-image diffusion models to memorize training data, which compromises generalization and raises safety concerns. Existing mitigation strategies often degrade image quality or prompt alignment. To overcome this limitation, the authors model the diffusion denoising process as a dynamical system and introduce a novel framework that integrates reachability analysis with constrained reinforcement learning. Reachability analysis identifies intermediate states likely to evolve into memorized samples, while constrained reinforcement learning applies minimal perturbations in the caption embedding space to steer the generation trajectory away from these memory-prone regions. The approach operates without modifying the backbone model, enabling plug-and-play deployment. It achieves state-of-the-art performance by simultaneously preserving high image fidelity (FID), strong prompt alignment (CLIP score), and significantly enhanced output diversity (SSCD), thereby dominating the Pareto frontier compared to existing methods.

Technology Category

Application Category

📝 Abstract

Text-to-image diffusion models often memorize training data, revealing a fundamental failure to generalize beyond the training set. Current mitigation strategies typically sacrifice image quality or prompt alignment to reduce memorization. To address this, we propose Reachability-Aware Diffusion Steering (RADS), an inference-time framework that prevents memorization while preserving generation fidelity. RADS models the diffusion denoising process as a dynamical system and applies concepts from reachability analysis to approximate the "backward reachable tube"--the set of intermediate states that inevitably evolve into memorized samples. We then formulate mitigation as a constrained reinforcement learning (RL) problem, where a policy learns to steer the trajectory away from memorization via minimal perturbations in the caption embedding space. Empirical evaluations show that RADS achieves a superior Pareto frontier between generation diversity (SSCD), quality (FID), and alignment (CLIP) compared to state-of-the-art baselines. Crucially, RADS provides robust mitigation without modifying the diffusion backbone, offering a plug-and-play solution for safe generation. Our website is available at: https://s-karnik.github.io/rads-memorization-project-page/.

Problem

Research questions and friction points this paper is trying to address.

memorization

text-to-image diffusion

generalization

image generation

training data

Innovation

Methods, ideas, or system contributions that make the work stand out.

reachability analysis

constrained reinforcement learning

diffusion models