Ar2Can: An Architect and an Artist Leveraging a Canvas for Multi-Human Generation

📅 2025-11-27

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Existing text-to-image models frequently suffer from inaccurate person counting, identity confusion, and facial duplication in multi-person scene generation. To address these issues, we propose Ar2Can, a two-stage framework that decouples spatial layout prediction from identity-aware rendering: the first stage generates structured human layouts (including position, orientation, and count), while the second stage performs fine-grained identity-preserving rendering via a spatially aligned face-matching reward mechanism—integrating ArcFace-based identity similarity with the Hungarian algorithm for optimal assignment. Training employs exclusively synthetic data and leverages Group Relative Policy Optimization to jointly optimize multiple reward objectives. Evaluated on MultiHuman-Testbench, Ar2Can achieves substantial improvements in person-count accuracy (+23.6%) and identity preservation (ID-Sim +0.18), while maintaining high visual fidelity—marking the first demonstration of reliable, high-fidelity multi-person image synthesis using synthetic data alone.

Technology Category

Application Category

📝 Abstract

Despite recent advances in text-to-image generation, existing models consistently fail to produce reliable multi-human scenes, often duplicating faces, merging identities, or miscounting individuals. We present Ar2Can, a novel two-stage framework that disentangles spatial planning from identity rendering for multi-human generation. The Architect module predicts structured layouts, specifying where each person should appear. The Artist module then synthesizes photorealistic images, guided by a spatially-grounded face matching reward that combines Hungarian spatial alignment with ArcFace identity similarity. This approach ensures faces are rendered at correct locations and faithfully preserve reference identities. We develop two Architect variants, seamlessly integrated with our diffusion-based Artist model and optimized via Group Relative Policy Optimization (GRPO) using compositional rewards for count accuracy, image quality, and identity matching. Evaluated on the MultiHuman-Testbench, Ar2Can achieves substantial improvements in both count accuracy and identity preservation, while maintaining high perceptual quality. Notably, our method achieves these results using primarily synthetic data, without requiring real multi-human images.

Problem

Research questions and friction points this paper is trying to address.

Generates multi-human scenes without duplicating faces or merging identities

Disentangles spatial layout planning from identity rendering in image generation

Ensures accurate person count and identity preservation using synthetic data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage framework separates layout planning from identity rendering

Architect predicts structured layouts, Artist synthesizes images with face matching

Optimized via GRPO with compositional rewards for accuracy and quality

🔎 Similar Papers

No similar papers found.