Whispers in the Noise: Surrogate-Guided Concept Awakening via a Multi-Agent Framework

📅 2026-05-18

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

Existing concept erasure methods for diffusion models merely suppress—rather than fully eliminate—target concepts, leaving them susceptible to reactivation under black-box conditions and posing significant security risks. This work reveals, through the lens of denoising trajectories, that semantic information can persistently propagate even after erasure. To address this, we propose ConceptAgent, a novel framework that, without access to model parameters, gradients, or internal representations, enables the first training-free and controllable precise reactivation of erased concepts in a black-box setting via multi-agent collaboration and proxy-guided noise initialization. Experiments not only expose the fundamental limitations of current erasure techniques but also offer new insights and directions for semantic control mechanisms in diffusion models.

📝 Abstract

Diffusion models (DMs) are widely used for text-to-image generation, but their strong generative capabilities also raise concerns about unsafe or undesirable content. Concept erasure aims to mitigate these risks by removing specific concepts from pretrained models. However, recent studies show that such methods often suppress rather than fully eliminate target concepts, leaving models vulnerable to awakening attacks. Existing approaches primarily rely on white-box access through optimization or inversion, while concept awakening under black-box constraints remains underexplored. In this work, we revisit the denoising process from a trajectory perspective and show that concept erasure mainly disrupts early-stage text-semantic alignment but does not fully prevent semantic information from propagating along the denoising dynamics. As generation proceeds, the model increasingly depends on the evolving noisy state rather than textual conditions, which creates an opportunity to bypass erased mappings. Motivated by this observation, we propose ConceptAgent, a training-free, black-box, multi-agent framework that awakens erased concepts by initializing the denoising trajectory from surrogate-guided noisy states. Extensive experiments demonstrate that ConceptAgent enables accurate and controllable awakening of erased concepts under black-box settings without access to model parameters, gradients, or internal representations. These results highlight fundamental limitations of current concept erasure methods and provide new insights into the dynamic nature of semantic control in DMs.

Problem

Research questions and friction points this paper is trying to address.

concept erasure

diffusion models

black-box attack

concept awakening

text-to-image generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

concept erasure

black-box attack

diffusion models