🤖 AI Summary
Existing text-to-image diffusion models are vulnerable to misuse for generating harmful content, while conventional safety mechanisms—such as text classifiers or ControlNet-style interventions—rely heavily on large-scale annotated data, are susceptible to adversarial bypassing, and lack scalability with evolving model architectures. This paper proposes SteerDiff, a lightweight, embedding-space-level concept intervention module that dynamically identifies and manipulates unsafe semantic concepts between user text prompts and the diffusion model—without fine-tuning the backbone model or requiring extensive labeled data. Its key innovations include: (i) the first adapter design integrating textual embedding perturbation with directional semantic control; (ii) concept-decoupled representation learning; and (iii) red-teaming-driven robustness validation. Experiments demonstrate that SteerDiff significantly outperforms baselines across diverse harmful concept removal tasks, maintains high robustness against various red-team attacks, and generalizes effectively to controllable generation tasks such as concept forgetting.
📝 Abstract
Text-to-image (T2I) diffusion models have drawn attention for their ability to generate high-quality images with precise text alignment. However, these models can also be misused to produce inappropriate content. Existing safety measures, which typically rely on text classifiers or ControlNet-like approaches, are often insufficient. Traditional text classifiers rely on large-scale labeled datasets and can be easily bypassed by rephrasing. As diffusion models continue to scale, fine-tuning these safeguards becomes increasingly challenging and lacks flexibility. Recent red-teaming attack researches further underscore the need for a new paradigm to prevent the generation of inappropriate content. In this paper, we introduce SteerDiff, a lightweight adaptor module designed to act as an intermediary between user input and the diffusion model, ensuring that generated images adhere to ethical and safety standards with little to no impact on usability. SteerDiff identifies and manipulates inappropriate concepts within the text embedding space to guide the model away from harmful outputs. We conduct extensive experiments across various concept unlearning tasks to evaluate the effectiveness of our approach. Furthermore, we benchmark SteerDiff against multiple red-teaming strategies to assess its robustness. Finally, we explore the potential of SteerDiff for concept forgetting tasks, demonstrating its versatility in text-conditioned image generation.