CGCE: Classifier-Guided Concept Erasure in Generative Models

📅 2025-11-08

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

Safely and robustly erasing undesirable concepts from large generative models remains challenging without compromising model integrity or performance. Method: This paper proposes a plug-and-play framework that requires no weight modification. It constructs a lightweight classifier in the text embedding space to detect harmful prompts in real time, and achieves precise suppression of sensitive concepts via dynamic prompt reconstruction and inference-time embedding correction. Crucially, it introduces a novel multi-classifier aggregation guidance mechanism to enable joint suppression of multiple undesirable concepts. Results: Extensive experiments on text-to-image (T2I) and text-to-video (T2V) models demonstrate that the method significantly enhances robustness against red-teaming attacks while preserving generation quality and diversity for benign prompts—effectively balancing safety and fidelity.

Technology Category

Application Category

📝 Abstract

Recent advancements in large-scale generative models have enabled the creation of high-quality images and videos, but have also raised significant safety concerns regarding the generation of unsafe content. To mitigate this, concept erasure methods have been developed to remove undesirable concepts from pre-trained models. However, existing methods remain vulnerable to adversarial attacks that can regenerate the erased content. Moreover, achieving robust erasure often degrades the model's generative quality for safe, unrelated concepts, creating a difficult trade-off between safety and performance. To address this challenge, we introduce Classifier-Guided Concept Erasure (CGCE), an efficient plug-and-play framework that provides robust concept erasure for diverse generative models without altering their original weights. CGCE uses a lightweight classifier operating on text embeddings to first detect and then refine prompts containing undesired concepts. This approach is highly scalable, allowing for multi-concept erasure by aggregating guidance from several classifiers. By modifying only unsafe embeddings at inference time, our method prevents harmful content generation while preserving the model's original quality on benign prompts. Extensive experiments show that CGCE achieves state-of-the-art robustness against a wide range of red-teaming attacks. Our approach also maintains high generative utility, demonstrating a superior balance between safety and performance. We showcase the versatility of CGCE through its successful application to various modern T2I and T2V models, establishing it as a practical and effective solution for safe generative AI.

Problem

Research questions and friction points this paper is trying to address.

Removing unsafe content from generative models while maintaining quality

Addressing vulnerability of concept erasure methods to adversarial attacks

Balancing robust safety measures with preservation of generative performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses lightweight classifier to detect unsafe text embeddings

Refines prompts by modifying only unsafe embeddings

Enables multi-concept erasure through classifier aggregation

🔎 Similar Papers

STEREO: A Two-Stage Framework for Adversarially Robust Concept Erasing from Text-to-Image Diffusion Models