🤖 AI Summary
Existing text-to-image (T2I) models struggle to simultaneously erase multiple target concepts in complex scenes and lack support for compositional multi-concept erasure. This work formalizes this task for the first time and introduces Mosaic, a novel framework that leverages the spatial locality of concepts in diffusion-based T2I models. By dynamically generating concept-specific masks and selectively fusing them within the vector field, Mosaic enables simultaneous removal of multiple concepts without requiring additional optimization. The authors also construct CoME-Bench, a comprehensive benchmark encompassing both intra-class and cross-class scenarios. Experimental results demonstrate that the proposed method effectively removes multiple target concepts while significantly preserving non-target content, substantially outperforming existing single-concept erasure approaches.
📝 Abstract
Concept erasure has emerged as a key research direction for ensuring safe and ethical image synthesis in Text-to-Image (T2I) models. While existing studies have explored concept erasure across multiple concepts, they typically assume only a single target concept per image, a limitation increasingly exposed by modern flow-based T2I models, which can generate complex scenes with multiple concepts simultaneously. To address this gap, we introduce compositional multi-concept erasure, a new task that aims to simultaneously remove multiple target concepts within a single scene. We propose CoME-Bench, a benchmark for evaluating compositional multi-concept erasure, which covers both intra- and cross-category scenarios. We further propose Mosaic, a novel framework for multi-concept erasure in flow-based T2I models, which exploits the spatial locality of target concepts in the vector field by dynamically constructing concept-specific masks and selectively blending them without additional optimization. Extensive experiments demonstrate that Mosaic effectively removes multiple target concepts in complex compositional scenes while preserving non-target contexts.