XVerse: Consistent Multi-Subject Control of Identity and Semantic Attributes via DiT Modulation

📅 2025-06-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of jointly achieving fine-grained control over identity and semantic attributes (e.g., pose, style, illumination) in multi-subject text-to-image generation. We propose a reference-image-guided, token-level text-flow modulation method for DiT-based architectures. Specifically, a lightweight image-to-offset mapping network generates reference-driven modulation offsets for each text token, enabling disentangled modeling and independent controllability of identity and semantic attributes. Compared to existing approaches, our method significantly alleviates attribute entanglement and editing artifacts, thereby improving generation fidelity, cross-subject consistency, and editability. Extensive experiments demonstrate superior personalized control and synthesis quality—particularly in complex multi-subject scenarios—while maintaining computational efficiency and architectural compatibility with diffusion transformer backbones.

Technology Category

Application Category

📝 Abstract
Achieving fine-grained control over subject identity and semantic attributes (pose, style, lighting) in text-to-image generation, particularly for multiple subjects, often undermines the editability and coherence of Diffusion Transformers (DiTs). Many approaches introduce artifacts or suffer from attribute entanglement. To overcome these challenges, we propose a novel multi-subject controlled generation model XVerse. By transforming reference images into offsets for token-specific text-stream modulation, XVerse allows for precise and independent control for specific subject without disrupting image latents or features. Consequently, XVerse offers high-fidelity, editable multi-subject image synthesis with robust control over individual subject characteristics and semantic attributes. This advancement significantly improves personalized and complex scene generation capabilities.
Problem

Research questions and friction points this paper is trying to address.

Control subject identity and attributes in multi-subject image generation
Overcome artifacts and attribute entanglement in Diffusion Transformers
Achieve high-fidelity editable multi-subject synthesis
Innovation

Methods, ideas, or system contributions that make the work stand out.

DiT modulation for multi-subject control
Token-specific text-stream offsets
High-fidelity editable image synthesis
🔎 Similar Papers
No similar papers found.
B
Bowen Chen
Intelligent Creation Team, ByteDance
Mengyi Zhao
Mengyi Zhao
Beihang university
Computer VisionArtificial Intelligence
H
Haomiao Sun
Intelligent Creation Team, ByteDance
L
Li Chen
Intelligent Creation Team, ByteDance
X
Xu Wang
Intelligent Creation Team, ByteDance
Kang Du
Kang Du
University of Utah
Causal InferenceDomain Generalization
Xinglong Wu
Xinglong Wu
字节跳动算法工程师
人工智能