🤖 AI Summary
Current text-to-image models suffer from pervasive identity duplication, facial confusion, and inaccurate person-counting in multi-person generation. To address these limitations, we propose the first identity-diversity-enhanced reinforcement learning framework specifically designed for multi-human synthesis. Our method introduces Group-wise Relative Policy Optimization (GRPO) and a novel unsupervised composite reward mechanism—comprising facial similarity penalty, cross-sample identity suppression, person-count accuracy, and human preference scoring—integrated with flow-matching model fine-tuning and single-stage curriculum learning to ensure training stability. Evaluated on the DiverseHumans benchmark, our approach achieves a per-face identity accuracy of 98.6%, with global identity distributions closely approximating the ideal uniform distribution. It significantly outperforms leading open-source and commercial baselines while maintaining high visual fidelity.
📝 Abstract
State-of-the-art text-to-image models excel at realism but collapse on multi-human prompts - duplicating faces, merging identities, and miscounting individuals. We introduce DisCo (Reinforcement with Diversity Constraints), the first RL-based framework to directly optimize identity diversity in multi-human generation. DisCo fine-tunes flow-matching models via Group-Relative Policy Optimization (GRPO) with a compositional reward that (i) penalizes intra-image facial similarity, (ii) discourages cross-sample identity repetition, (iii) enforces accurate person counts, and (iv) preserves visual fidelity through human preference scores. A single-stage curriculum stabilizes training as complexity scales, requiring no extra annotations. On the DiverseHumans Testset, DisCo achieves 98.6 Unique Face Accuracy and near-perfect Global Identity Spread - surpassing both open-source and proprietary methods (e.g., Gemini, GPT-Image) while maintaining competitive perceptual quality. Our results establish DisCo as a scalable, annotation-free solution that resolves the long-standing identity crisis in generative models and sets a new benchmark for compositional multi-human generation.