🤖 AI Summary
This work addresses the challenge of jointly achieving fine-grained control over identity and semantic attributes (e.g., pose, style, illumination) in multi-subject text-to-image generation. We propose a reference-image-guided, token-level text-flow modulation method for DiT-based architectures. Specifically, a lightweight image-to-offset mapping network generates reference-driven modulation offsets for each text token, enabling disentangled modeling and independent controllability of identity and semantic attributes. Compared to existing approaches, our method significantly alleviates attribute entanglement and editing artifacts, thereby improving generation fidelity, cross-subject consistency, and editability. Extensive experiments demonstrate superior personalized control and synthesis quality—particularly in complex multi-subject scenarios—while maintaining computational efficiency and architectural compatibility with diffusion transformer backbones.
📝 Abstract
Achieving fine-grained control over subject identity and semantic attributes (pose, style, lighting) in text-to-image generation, particularly for multiple subjects, often undermines the editability and coherence of Diffusion Transformers (DiTs). Many approaches introduce artifacts or suffer from attribute entanglement. To overcome these challenges, we propose a novel multi-subject controlled generation model XVerse. By transforming reference images into offsets for token-specific text-stream modulation, XVerse allows for precise and independent control for specific subject without disrupting image latents or features. Consequently, XVerse offers high-fidelity, editable multi-subject image synthesis with robust control over individual subject characteristics and semantic attributes. This advancement significantly improves personalized and complex scene generation capabilities.