🤖 AI Summary
Existing subject customization methods face two key bottlenecks: learning-based U-Net approaches suffer from poor generalization and degraded image fidelity, while optimization-based methods require subject-specific fine-tuning, compromising text controllability. This paper introduces the first diffusion Transformer framework for open-domain character personalization—enabling high-fidelity, strongly text-controllable synthesis across poses, appearances, and styles without any subject-specific fine-tuning. Our core contributions are: (1) a lightweight stacked Transformer encoder adapter; (2) a million-scale multimodal dataset of multi-view images paired with descriptive text; and (3) a dual-path contrastive learning strategy with multi-view supervision to disentangle identity consistency from text-driven editability. Experiments demonstrate significant improvements in cross-domain consistency and generation quality while preserving state-of-the-art text controllability, establishing a new benchmark for character-driven image generation.
📝 Abstract
Current learning-based subject customization approaches, predominantly relying on U-Net architectures, suffer from limited generalization ability and compromised image quality. Meanwhile, optimization-based methods require subject-specific fine-tuning, which inevitably degrades textual controllability. To address these challenges, we propose InstantCharacter, a scalable framework for character customization built upon a foundation diffusion transformer. InstantCharacter demonstrates three fundamental advantages: first, it achieves open-domain personalization across diverse character appearances, poses, and styles while maintaining high-fidelity results. Second, the framework introduces a scalable adapter with stacked transformer encoders, which effectively processes open-domain character features and seamlessly interacts with the latent space of modern diffusion transformers. Third, to effectively train the framework, we construct a large-scale character dataset containing 10-million-level samples. The dataset is systematically organized into paired (multi-view character) and unpaired (text-image combinations) subsets. This dual-data structure enables simultaneous optimization of identity consistency and textual editability through distinct learning pathways. Qualitative experiments demonstrate the advanced capabilities of InstantCharacter in generating high-fidelity, text-controllable, and character-consistent images, setting a new benchmark for character-driven image generation. Our source code is available at https://github.com/Tencent/InstantCharacter.