SafeRedir: Prompt Embedding Redirection for Robust Unlearning in Image Generation Models

📅 2026-01-13
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes SafeRedir, a lightweight inference-time framework that mitigates the risk of image generation models reproducing unsafe content—such as NSFW imagery or copyrighted artistic styles—without requiring model retraining. SafeRedir introduces the first token-level semantic redirection mechanism, leveraging a latent-aware multimodal safety classifier to detect risky prompts and an adaptive scaling strategy combined with a token-level incremental generator in the embedding space to precisely steer unsafe semantics toward safe regions. Experiments across multiple diffusion models demonstrate that SafeRedir effectively suppresses unsafe outputs while preserving high semantic fidelity, image quality, and robustness against prompt rephrasing and adversarial attacks. The method is plug-and-play, highly generalizable, and incurs minimal degradation in generation performance.

Technology Category

Application Category

📝 Abstract
Image generation models (IGMs), while capable of producing impressive and creative content, often memorize a wide range of undesirable concepts from their training data, leading to the reproduction of unsafe content such as NSFW imagery and copyrighted artistic styles. Such behaviors pose persistent safety and compliance risks in real-world deployments and cannot be reliably mitigated by post-hoc filtering, owing to the limited robustness of such mechanisms and a lack of fine-grained semantic control. Recent unlearning methods seek to erase harmful concepts at the model level, which exhibit the limitations of requiring costly retraining, degrading the quality of benign generations, or failing to withstand prompt paraphrasing and adversarial attacks. To address these challenges, we introduce SafeRedir, a lightweight inference-time framework for robust unlearning via prompt embedding redirection. Without modifying the underlying IGMs, SafeRedir adaptively routes unsafe prompts toward safe semantic regions through token-level interventions in the embedding space. The framework comprises two core components: a latent-aware multi-modal safety classifier for identifying unsafe generation trajectories, and a token-level delta generator for precise semantic redirection, equipped with auxiliary predictors for token masking and adaptive scaling to localize and regulate the intervention. Empirical results across multiple representative unlearning tasks demonstrate that SafeRedir achieves effective unlearning capability, high semantic and perceptual preservation, robust image quality, and enhanced resistance to adversarial attacks. Furthermore, SafeRedir generalizes effectively across a variety of diffusion backbones and existing unlearned models, validating its plug-and-play compatibility and broad applicability. Code and data are available at https://github.com/ryliu68/SafeRedir.
Problem

Research questions and friction points this paper is trying to address.

image generation models
unsafe content
unlearning
safety
adversarial robustness
Innovation

Methods, ideas, or system contributions that make the work stand out.

prompt embedding redirection
inference-time unlearning
token-level intervention
robust safety control
diffusion model compatibility
🔎 Similar Papers
No similar papers found.