🤖 AI Summary
Personalized text-to-image generation models often face a trade-off between concept fidelity and textual alignment due to the entanglement of irrelevant residual information from reference images. To address this, this work proposes ConceptPrism, an unsupervised concept disentanglement method that operates without manual guidance such as linguistic prompts or segmentation masks. ConceptPrism automatically distinguishes shared visual concepts from image-specific residuals within a set of reference images and introduces a novel exclusion loss that actively drives residual tokens to discard shared semantics, thereby enabling concept tokens to represent the core content in a purified manner. Within a diffusion-based joint optimization framework, reconstruction and exclusion losses are simultaneously leveraged to implicitly separate conceptual and residual information. Experiments demonstrate that ConceptPrism effectively mitigates concept entanglement, achieving high fidelity while significantly improving text alignment.
📝 Abstract
Personalized text-to-image generation suffers from concept entanglement, where irrelevant residual information from reference images is captured, leading to a trade-off between concept fidelity and text alignment. Recent disentanglement approaches attempt to solve this utilizing manual guidance, such as linguistic cues or segmentation masks, which limits their applicability and fails to fully articulate the target concept. In this paper, we propose ConceptPrism, a novel framework that automatically disentangles the shared visual concept from image-specific residuals by comparing images within a set. Our method jointly optimizes a target token and image-wise residual tokens using two complementary objectives: a reconstruction loss to ensure fidelity, and a novel exclusion loss that compels residual tokens to discard the shared concept. This process allows the target token to capture the pure concept without direct supervision. Extensive experiments demonstrate that ConceptPrism effectively resolves concept entanglement, achieving a significantly improved trade-off between fidelity and alignment.