🤖 AI Summary
This work addresses the challenge of achieving high-fidelity personalized image generation with diffusion Transformers (DiTs) in a zero-shot, training-free setting. Methodologically, we propose a fine-tuning-free personalization paradigm that integrates temporal-adaptive token replacement and structured patch perturbation, dynamically injecting user-specified concept features during DiT’s forward denoising process. This enables zero-shot reconstruction, layout-guided generation, multi-subject customization, and mask-driven editing—without architectural modification or additional training. To our knowledge, this is the first approach to jointly achieve high-fidelity identity preservation and strong generalization within the DiT framework under strictly zero-shot conditions. Extensive experiments demonstrate state-of-the-art performance in identity consistency, editing flexibility, and cross-scenario compatibility, while significantly reducing computational overhead. Crucially, our method is fully compatible with existing DiT models, requiring no retraining or inference-time adaptation.
📝 Abstract
Personalized image generation aims to produce images of user-specified concepts while enabling flexible editing. Recent training-free approaches, while exhibit higher computational efficiency than training-based methods, struggle with identity preservation, applicability, and compatibility with diffusion transformers (DiTs). In this paper, we uncover the untapped potential of DiT, where simply replacing denoising tokens with those of a reference subject achieves zero-shot subject reconstruction. This simple yet effective feature injection technique unlocks diverse scenarios, from personalization to image editing. Building upon this observation, we propose extbf{Personalize Anything}, a training-free framework that achieves personalized image generation in DiT through: 1) timestep-adaptive token replacement that enforces subject consistency via early-stage injection and enhances flexibility through late-stage regularization, and 2) patch perturbation strategies to boost structural diversity. Our method seamlessly supports layout-guided generation, multi-subject personalization, and mask-controlled editing. Evaluations demonstrate state-of-the-art performance in identity preservation and versatility. Our work establishes new insights into DiTs while delivering a practical paradigm for efficient personalization.