MultiCrafter: High-Fidelity Multi-Subject Generation via Spatially Disentangled Attention and Identity-Aware Reinforcement Learning

📅 2025-09-26

📈 Citations: 0

✨ Influential: 0

career value

235K/year

🤖 AI Summary

Multi-subject image generation faces challenges including low subject fidelity, severe attribute leakage, and insufficient alignment with human aesthetic preferences—largely due to coarse-grained modeling inherent in reconstruction-based objectives and contextual learning paradigms. To address these issues, we propose a decoupled spatial attention and identity-aware reinforcement learning framework. First, we introduce explicit positional supervision to achieve spatial disentanglement of attention regions across subjects. Second, we design a lightweight Mixture-of-Experts (MoE) architecture to enhance scene-adaptive generation capability. Third, we develop an online reinforcement learning module grounded in human preference scores, jointly optimizing reconstruction loss and aesthetic alignment objectives. Experiments demonstrate that our method significantly suppresses attribute leakage, substantially improves identity fidelity and semantic consistency in complex multi-subject scenes, and generates images better aligned with human aesthetic preferences.

Technology Category

Application Category

📝 Abstract

Multi-subject image generation aims to synthesize user-provided subjects in a single image while preserving subject fidelity, ensuring prompt consistency, and aligning with human aesthetic preferences. However, existing methods, particularly those built on the In-Context-Learning paradigm, are limited by their reliance on simple reconstruction-based objectives, leading to both severe attribute leakage that compromises subject fidelity and failing to align with nuanced human preferences. To address this, we propose MultiCrafter, a framework that ensures high-fidelity, preference-aligned generation. First, we find that the root cause of attribute leakage is a significant entanglement of attention between different subjects during the generation process. Therefore, we introduce explicit positional supervision to explicitly separate attention regions for each subject, effectively mitigating attribute leakage. To enable the model to accurately plan the attention region of different subjects in diverse scenarios, we employ a Mixture-of-Experts architecture to enhance the model's capacity, allowing different experts to focus on different scenarios. Finally, we design a novel online reinforcement learning framework to align the model with human preferences, featuring a scoring mechanism to accurately assess multi-subject fidelity and a more stable training strategy tailored for the MoE architecture. Experiments validate that our framework significantly improves subject fidelity while aligning with human preferences better.

Problem

Research questions and friction points this paper is trying to address.

Preventing attribute leakage between subjects

Separating attention regions with positional supervision

Aligning generation with human aesthetic preferences

Innovation

Methods, ideas, or system contributions that make the work stand out.

Spatially disentangled attention prevents attribute leakage

Mixture-of-Experts enhances scenario planning capacity

Identity-aware reinforcement learning aligns with human preferences

🔎 Similar Papers

MS-Diffusion: Multi-subject Zero-shot Image Personalization with Layout Guidance