Continuous, Subject-Specific Attribute Control in T2I Models by Identifying Semantic Directions

📅 2024-03-25
🏛️ arXiv.org
📈 Citations: 6
Influential: 0
📄 PDF
🤖 AI Summary
Existing text-to-image diffusion models struggle to simultaneously achieve precise spatial localization and fine-grained, continuous control over specific object attributes. To address this, we propose a model-agnostic disentangled control method that requires no architectural modification. We first discover transferable token-level semantic directions within CLIP text embeddings; leveraging this insight, we establish an optimization-free direction identification framework coupled with a model-driven direction modeling mechanism. By applying controlled perturbations in the prompt embedding space, our approach enables parallel, continuous intensity adjustment of multiple attributes for a single subject. This unified framework resolves the inherent trade-off between global semantic control and local attribute localization, significantly improving both accuracy and flexibility of attribute manipulation while preserving subject identity. Code and an interactive demo are publicly available.

Technology Category

Application Category

📝 Abstract
Recent advances in text-to-image (T2I) diffusion models have significantly improved the quality of generated images. However, providing efficient control over individual subjects, particularly the attributes characterizing them, remains a key challenge. While existing methods have introduced mechanisms to modulate attribute expression, they typically provide either detailed, object-specific localization of such a modification or full-scale fine-grained, nuanced control of attributes. No current approach offers both simultaneously, resulting in a gap when trying to achieve precise continuous and subject-specific attribute modulation in image generation. In this work, we demonstrate that token-level directions exist within commonly used CLIP text embeddings that enable fine-grained, subject-specific control of high-level attributes in T2I models. We introduce two methods to identify these directions: a simple, optimization-free technique and a learning-based approach that utilizes the T2I model to characterize semantic concepts more specifically. Our methods allow the augmentation of the prompt text input, enabling fine-grained control over multiple attributes of individual subjects simultaneously, without requiring any modifications to the diffusion model itself. This approach offers a unified solution that fills the gap between global and localized control, providing competitive flexibility and precision in text-guided image generation. Project page: https://compvis.github.io/attribute-control. Code is available at https://github.com/CompVis/attribute-control.
Problem

Research questions and friction points this paper is trying to address.

Lack of precise continuous subject-specific attribute control in T2I models.
Existing methods fail to combine detailed localization and nuanced attribute control.
Need for unified solution enabling fine-grained control over multiple subject attributes.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Token-level directions in CLIP embeddings
Optimization-free and learning-based semantic direction identification
Augmented prompt text for fine-grained attribute control
🔎 Similar Papers
No similar papers found.