🤖 AI Summary
Existing methods struggle to simultaneously satisfy structural layout fidelity and fine-grained semantic/visual control in human image generation—particularly limiting fashion design applications. To address this, we propose the first controllable portrait generation framework tailored for hand-drawn layouts. Our method introduces a decoupled multimodal conditional control mechanism enabling independent textual or reference-image constraints per body part; a lightweight color-blocked geometric sketch representation with an accompanying vectorization-based preprocessing pipeline; and ComposeHuman—the first part-level disentangled text-image annotated dataset. Built upon diffusion models, our architecture integrates a layout encoder, conditional adaptation module, and disentangled attention mechanism. Experiments demonstrate significant improvements over baselines in layout fidelity, text alignment, and reference-image similarity. The framework supports flexible editing, cross-modal composition, and zero-shot part replacement—achieving robust multi-task controllability.
📝 Abstract
Building on the success of diffusion models, significant advancements have been made in multimodal image generation tasks. Among these, human image generation has emerged as a promising technique, offering the potential to revolutionize the fashion design process. However, existing methods often focus solely on text-to-image or image reference-based human generation, which fails to satisfy the increasingly sophisticated demands. To address the limitations of flexibility and precision in human generation, we introduce ComposeAnyone, a controllable layout-to-human generation method with decoupled multimodal conditions. Specifically, our method allows decoupled control of any part in hand-drawn human layouts using text or reference images, seamlessly integrating them during the generation process. The hand-drawn layout, which utilizes color-blocked geometric shapes such as ellipses and rectangles, can be easily drawn, offering a more flexible and accessible way to define spatial layouts. Additionally, we introduce the ComposeHuman dataset, which provides decoupled text and reference image annotations for different components of each human image, enabling broader applications in human image generation tasks. Extensive experiments on multiple datasets demonstrate that ComposeAnyone generates human images with better alignment to given layouts, text descriptions, and reference images, showcasing its multi-task capability and controllability.