Responsible Diffusion Models via Constraining Text Embeddings within Safe Regions

📅 2025-04-22

🏛️ The Web Conference

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

Diffusion models generate high-fidelity images but often produce NSFW content and reinforce societal biases, hindering real-world deployment. To address this, we propose a safety-constrained framework operating directly in the text embedding space—without altering the original prompt. Our method introduces a novel self-discovered semantic direction vector mechanism that geometrically steers text embeddings toward predefined safe regions. By initializing direction vectors via LoRA and jointly fine-tuning for safety, we achieve low-intrusiveness. Evaluated across multiple benchmarks, our approach significantly reduces NSFW generation (average ↓62.3%) and bias metrics (e.g., Stereoset ↓41.7%), while preserving image fidelity (FID change < 0.8). It outperforms existing mainstream safety mitigation methods in both effectiveness and preservation of generation quality.

Technology Category

Application Category

📝 Abstract

The remarkable ability of diffusion models to generate high-fidelity images has led to their widespread adoption. However, concerns have also arisen regarding their potential to produce Not Safe for Work (NSFW) content and exhibit social biases, hindering their practical use in real-world applications. In response to this challenge, prior work has focused on employing security filters to identify and exclude toxic text, or alternatively, fine-tuning pre-trained diffusion models to erase sensitive concepts. Unfortunately, existing methods struggle to achieve satisfactory performance in the sense that they can have a significant impact on the normal model output while still failing to prevent the generation of harmful content in some cases. In this paper, we propose a novel self-discovery approach to identifying a semantic direction vector in the embedding space to restrict text embedding within a safe region. Our method circumvents the need for correcting individual words within the input text and steers the entire text prompt towards a safe region in the embedding space, thereby enhancing model robustness against all possibly unsafe prompts. In addition, we employ Low-Rank Adaptation (LoRA) for semantic direction vector initialization to reduce the impact on the model performance for other semantics. Furthermore, our method can also be integrated with existing methods to improve their social responsibility. Extensive experiments on benchmark datasets demonstrate that our method can effectively reduce NSFW content and mitigate social bias generated by diffusion models compared to several state-of-the-art baselines. WARNING:This paper contains model-generated images that may be potentially offensive.

Problem

Research questions and friction points this paper is trying to address.

Preventing NSFW content in diffusion model outputs

Reducing social biases in generated images

Maintaining model performance while enhancing safety

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-discovery of semantic direction vector

Low-Rank Adaptation for vector initialization

Integration with existing responsibility methods

🔎 Similar Papers

SteerDiff: Steering towards Safe Text-to-Image Diffusion Models