🤖 AI Summary
Existing text-to-image models often struggle to balance semantic diversity and compositional accuracy when generating multi-person interaction scenes, frequently suffering from repetitive layouts, stereotyped poses, and inaccurate interactions. To address this, this work proposes a pose-image co-evolutionary dual-stream diffusion Transformer framework that integrates human-centric structural priors to jointly predict 2D pose maps and RGB images. By leveraging cross-modal alignment and an iterative scene construction strategy, the model progressively refines both structure and appearance. This approach effectively decomposes the complexity of generating intricate interactive scenes, significantly improving prompt adherence and output diversity while mitigating layout repetition and interaction distortion.
📝 Abstract
Despite recent progress, text-to-image models still struggle to generate semantically diverse and compositionally accurate multi-person interaction scenes, often collapsing to repetitive layouts, stereotypical poses, and poorly grounded interactions. In this work, we bridge this gap by introducing a dual pose-image representation that brings person-centric structural priors into pretrained diffusion transformers. Our model jointly predicts a 2D pose visualization image and its corresponding RGB image, enabling structure and appearance to co-evolve during learning. At its core, a cross-modal alignment scheme binds text, pose, and image representations, ensuring consistent grounding across modalities. Furthermore, we design an iterative scene construction scheme, progressively generating complex multi-human interactions while effectively decomposing the overall generation complexity. Extensive experiments demonstrate that our method substantially improves prompt alignment and scene diversity in multi-person image generation.