Responsible Diffusion Models via Constraining Text Embeddings within Safe Regions

πŸ“… 2025-04-22
πŸ›οΈ The Web Conference
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Diffusion models generate high-fidelity images but often produce NSFW content and reinforce societal biases, hindering real-world deployment. To address this, we propose a safety-constrained framework operating directly in the text embedding spaceβ€”without altering the original prompt. Our method introduces a novel self-discovered semantic direction vector mechanism that geometrically steers text embeddings toward predefined safe regions. By initializing direction vectors via LoRA and jointly fine-tuning for safety, we achieve low-intrusiveness. Evaluated across multiple benchmarks, our approach significantly reduces NSFW generation (average ↓62.3%) and bias metrics (e.g., Stereoset ↓41.7%), while preserving image fidelity (FID change < 0.8). It outperforms existing mainstream safety mitigation methods in both effectiveness and preservation of generation quality.

Technology Category

Application Category

πŸ“ Abstract
The remarkable ability of diffusion models to generate high-fidelity images has led to their widespread adoption. However, concerns have also arisen regarding their potential to produce Not Safe for Work (NSFW) content and exhibit social biases, hindering their practical use in real-world applications. In response to this challenge, prior work has focused on employing security filters to identify and exclude toxic text, or alternatively, fine-tuning pre-trained diffusion models to erase sensitive concepts. Unfortunately, existing methods struggle to achieve satisfactory performance in the sense that they can have a significant impact on the normal model output while still failing to prevent the generation of harmful content in some cases. In this paper, we propose a novel self-discovery approach to identifying a semantic direction vector in the embedding space to restrict text embedding within a safe region. Our method circumvents the need for correcting individual words within the input text and steers the entire text prompt towards a safe region in the embedding space, thereby enhancing model robustness against all possibly unsafe prompts. In addition, we employ Low-Rank Adaptation (LoRA) for semantic direction vector initialization to reduce the impact on the model performance for other semantics. Furthermore, our method can also be integrated with existing methods to improve their social responsibility. Extensive experiments on benchmark datasets demonstrate that our method can effectively reduce NSFW content and mitigate social bias generated by diffusion models compared to several state-of-the-art baselines. WARNING:This paper contains model-generated images that may be potentially offensive.
Problem

Research questions and friction points this paper is trying to address.

Preventing NSFW content in diffusion model outputs
Reducing social biases in generated images
Maintaining model performance while enhancing safety
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-discovery of semantic direction vector
Low-Rank Adaptation for vector initialization
Integration with existing responsibility methods
πŸ”Ž Similar Papers
No similar papers found.
Zhiwen Li
Zhiwen Li
NIAID
Bioinformatics
D
Die Chen
East China Normal University
Mingyuan Fan
Mingyuan Fan
Kunlun Inc
AIGC Semantic Segmentation
C
Cen Chen
East China Normal University
Yaliang Li
Yaliang Li
Alibaba Group
Machine Learning
Y
Yanhao Wang
East China Normal University
W
Wenmeng Zhou
Alibaba Group