Training-Free Safe Text Embedding Guidance for Text-to-Image Diffusion Models

📅 2025-10-27

📈 Citations: 0

✨ Influential: 0

career value

153K/year

🤖 AI Summary

Text-to-image diffusion models often generate unsafe images under adversarial prompts due to biases or harmful content in training data. To address this, we propose a training-free safety-guided text embedding method: during diffusion sampling, differentiable rendering approximations dynamically project safety constraints—such as nudity, violence, and stylistic misuse—back into the text embedding space, enabling joint optimization of safety compliance and semantic fidelity. This work introduces the first training-free safety guidance mechanism, with theoretical guarantees that the model’s output distribution converges toward a user-specified safety prior. Experiments across diverse sensitive scenarios demonstrate that our approach significantly outperforms both training-based and training-free baselines, effectively suppressing harmful content while preserving the semantic integrity of the input prompt.

Technology Category

Application Category

📝 Abstract

Text-to-image models have recently made significant advances in generating realistic and semantically coherent images, driven by advanced diffusion models and large-scale web-crawled datasets. However, these datasets often contain inappropriate or biased content, raising concerns about the generation of harmful outputs when provided with malicious text prompts. We propose Safe Text embedding Guidance (STG), a training-free approach to improve the safety of diffusion models by guiding the text embeddings during sampling. STG adjusts the text embeddings based on a safety function evaluated on the expected final denoised image, allowing the model to generate safer outputs without additional training. Theoretically, we show that STG aligns the underlying model distribution with safety constraints, thereby achieving safer outputs while minimally affecting generation quality. Experiments on various safety scenarios, including nudity, violence, and artist-style removal, show that STG consistently outperforms both training-based and training-free baselines in removing unsafe content while preserving the core semantic intent of input prompts. Our code is available at https://github.com/aailab-kaist/STG.

Problem

Research questions and friction points this paper is trying to address.

Preventing harmful content generation from malicious text prompts

Improving safety of diffusion models without requiring retraining

Removing unsafe content while preserving semantic intent

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free safe text embedding guidance

Adjusts embeddings via safety function evaluation

Aligns model distribution with safety constraints

🔎 Similar Papers

SteerDiff: Steering towards Safe Text-to-Image Diffusion Models