ComposeMe: Attribute-Specific Image Prompts for Controllable Human Image Generation

📅 2025-09-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
High-fidelity disentangled control of identity, hairstyle, clothing, and other attributes remains challenging in personalized human image generation. Method: This paper proposes Attribute-Specific Visual Prompts (ASVP), a novel paradigm that encodes multiple reference images into separate, attribute-specific tokens—e.g., for hairstyle, clothing, and identity—and injects them into a pre-trained text-to-image diffusion model. We further introduce a cross-reference joint training strategy and a dedicated dataset to enable multi-attribute disentangled editing and natural composition, extending seamlessly to multi-person scenarios. Contribution/Results: Experiments demonstrate state-of-the-art performance in both textual and visual prompt adherence. Our method significantly improves accuracy and visual naturalness in fine-grained controllable generation, enabling precise, independent manipulation of semantic attributes while preserving global coherence and realism.

Technology Category

Application Category

📝 Abstract
Generating high-fidelity images of humans with fine-grained control over attributes such as hairstyle and clothing remains a core challenge in personalized text-to-image synthesis. While prior methods emphasize identity preservation from a reference image, they lack modularity and fail to provide disentangled control over specific visual attributes. We introduce a new paradigm for attribute-specific image prompting, in which distinct sets of reference images are used to guide the generation of individual aspects of human appearance, such as hair, clothing, and identity. Our method encodes these inputs into attribute-specific tokens, which are injected into a pre-trained text-to-image diffusion model. This enables compositional and disentangled control over multiple visual factors, even across multiple people within a single image. To promote natural composition and robust disentanglement, we curate a cross-reference training dataset featuring subjects in diverse poses and expressions, and propose a multi-attribute cross-reference training strategy that encourages the model to generate faithful outputs from misaligned attribute inputs while adhering to both identity and textual conditioning. Extensive experiments show that our method achieves state-of-the-art performance in accurately following both visual and textual prompts. Our framework paves the way for more configurable human image synthesis by combining visual prompting with text-driven generation. Webpage is available at: https://snap-research.github.io/composeme/.
Problem

Research questions and friction points this paper is trying to address.

Achieving fine-grained control over human attributes in image generation
Overcoming limitations in modularity and disentangled attribute control
Enabling compositional control across multiple people in single images
Innovation

Methods, ideas, or system contributions that make the work stand out.

Attribute-specific image prompting for modular control
Encoding reference images into attribute-specific diffusion tokens
Multi-attribute cross-reference training for robust disentanglement
🔎 Similar Papers