SAGE: Exploring the Boundaries of Unsafe Concept Domain with Semantic-Augment Erasing

📅 2025-06-11

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

Pretrained diffusion models (DMs) pose safety and copyright risks due to implicit sensitive concepts encoded in their representations; existing word-level erasure methods suffer from poor generalization, trapped in the “lexical concept pit.” This paper proposes a semantic-enhanced erasure framework: first, it introduces a concept-domain boundary exploration paradigm that lifts erasure from discrete token spaces into continuous semantic embedding spaces; second, it designs a cyclic self-supervised erasure mechanism that jointly enforces global semantic alignment and preserves local noise structure, enabling broad suppression of unsafe concepts while maintaining high-fidelity reconstruction of irrelevant semantics. Evaluated across multiple benchmarks, our method achieves state-of-the-art performance in safety compliance, generation fidelity, and cross-concept generalizability. Code and pretrained model weights are publicly released.

Technology Category

Application Category

📝 Abstract

Diffusion models (DMs) have achieved significant progress in text-to-image generation. However, the inevitable inclusion of sensitive information during pre-training poses safety risks, such as unsafe content generation and copyright infringement. Concept erasing finetunes weights to unlearn undesirable concepts, and has emerged as a promising solution. However, existing methods treat unsafe concept as a fixed word and repeatedly erase it, trapping DMs in ``word concept abyss'', which prevents generalized concept-related erasing. To escape this abyss, we introduce semantic-augment erasing which transforms concept word erasure into concept domain erasure by the cyclic self-check and self-erasure. It efficiently explores and unlearns the boundary representation of concept domain through semantic spatial relationships between original and training DMs, without requiring additional preprocessed data. Meanwhile, to mitigate the retention degradation of irrelevant concepts while erasing unsafe concepts, we further propose the global-local collaborative retention mechanism that combines global semantic relationship alignment with local predicted noise preservation, effectively expanding the retentive receptive field for irrelevant concepts. We name our method SAGE, and extensive experiments demonstrate the comprehensive superiority of SAGE compared with other methods in the safe generation of DMs. The code and weights will be open-sourced at https://github.com/KevinLight831/SAGE.

Problem

Research questions and friction points this paper is trying to address.

Eliminating unsafe content in diffusion models

Escaping the word concept abyss issue

Preserving irrelevant concepts during erasure

Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic-augment erasing transforms word to domain erasure

Global-local retention mechanism preserves irrelevant concepts

Cyclic self-check explores boundary of unsafe concepts

🔎 Similar Papers

Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?