SafeText: Safe Text-to-image Models via Aligning the Text Encoder

📅 2025-02-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Text-to-image models often generate unsafe content when prompted with harmful inputs, while existing alignment methods—primarily modifying the diffusion module—frequently degrade generation quality under benign prompts. This work proposes the first safety alignment framework focused exclusively on fine-tuning the text encoder. It distinguishes safe from harmful prompts via semantic embedding perturbation, enabling targeted embedding shifts and multi-stage training—without compromising the diffusion model’s generalization capability. The approach achieves fine-grained safety control while preserving fidelity. Evaluated across multiple benchmarks, it outperforms six state-of-the-art alignment methods: harmful image generation drops by over 92%, and FID under safe prompts increases by only 0.8, demonstrating exceptional trade-off between safety and generation quality. Code and data will be publicly released.

Technology Category

Application Category

📝 Abstract
Text-to-image models can generate harmful images when presented with unsafe prompts, posing significant safety and societal risks. Alignment methods aim to modify these models to ensure they generate only non-harmful images, even when exposed to unsafe prompts. A typical text-to-image model comprises two main components: 1) a text encoder and 2) a diffusion module. Existing alignment methods mainly focus on modifying the diffusion module to prevent harmful image generation. However, this often significantly impacts the model's behavior for safe prompts, causing substantial quality degradation of generated images. In this work, we propose SafeText, a novel alignment method that fine-tunes the text encoder rather than the diffusion module. By adjusting the text encoder, SafeText significantly alters the embedding vectors for unsafe prompts, while minimally affecting those for safe prompts. As a result, the diffusion module generates non-harmful images for unsafe prompts while preserving the quality of images for safe prompts. We evaluate SafeText on multiple datasets of safe and unsafe prompts, including those generated through jailbreak attacks. Our results show that SafeText effectively prevents harmful image generation with minor impact on the images for safe prompts, and SafeText outperforms six existing alignment methods. We will publish our code and data after paper acceptance.
Problem

Research questions and friction points this paper is trying to address.

Prevent harmful image generation from unsafe prompts
Minimize quality degradation for safe prompts
Align text encoder to improve model safety
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tunes text encoder for safe image generation
Minimally affects safe prompt image quality
Outperforms six existing alignment methods
🔎 Similar Papers