XVerse: Consistent Multi-Subject Control of Identity and Semantic Attributes via DiT Modulation

📅 2025-06-26

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This work addresses the challenge of jointly achieving fine-grained control over identity and semantic attributes (e.g., pose, style, illumination) in multi-subject text-to-image generation. We propose a reference-image-guided, token-level text-flow modulation method for DiT-based architectures. Specifically, a lightweight image-to-offset mapping network generates reference-driven modulation offsets for each text token, enabling disentangled modeling and independent controllability of identity and semantic attributes. Compared to existing approaches, our method significantly alleviates attribute entanglement and editing artifacts, thereby improving generation fidelity, cross-subject consistency, and editability. Extensive experiments demonstrate superior personalized control and synthesis quality—particularly in complex multi-subject scenarios—while maintaining computational efficiency and architectural compatibility with diffusion transformer backbones.

Technology Category

Application Category

📝 Abstract

Achieving fine-grained control over subject identity and semantic attributes (pose, style, lighting) in text-to-image generation, particularly for multiple subjects, often undermines the editability and coherence of Diffusion Transformers (DiTs). Many approaches introduce artifacts or suffer from attribute entanglement. To overcome these challenges, we propose a novel multi-subject controlled generation model XVerse. By transforming reference images into offsets for token-specific text-stream modulation, XVerse allows for precise and independent control for specific subject without disrupting image latents or features. Consequently, XVerse offers high-fidelity, editable multi-subject image synthesis with robust control over individual subject characteristics and semantic attributes. This advancement significantly improves personalized and complex scene generation capabilities.

Problem

Research questions and friction points this paper is trying to address.

Control subject identity and attributes in multi-subject image generation

Overcome artifacts and attribute entanglement in Diffusion Transformers

Achieve high-fidelity editable multi-subject synthesis

Innovation

Methods, ideas, or system contributions that make the work stand out.

DiT modulation for multi-subject control

Token-specific text-stream offsets

High-fidelity editable image synthesis

🔎 Similar Papers

A Survey on Decentralized Identifiers and Verifiable Credentials