🤖 AI Summary
To address scene degradation, weak controllability, and low identity fidelity in text-to-image diffusion models for generating realistic human portraits, this paper proposes CustomEnhancer—a zero-shot, fine-tuning-free framework for personalized portrait enhancement. Its core comprises a three-stream fused PerGeneration architecture for generative modeling and ResInversion, a pre-diffusion-based inversion technique that jointly unifies generation and reconstruction. ResInversion accelerates inverse inference by 129× over NTI, drastically reducing computational overhead; the three-stream fusion enables precise cross-model control and high-fidelity identity reconstruction. Experiments demonstrate state-of-the-art performance in scene diversity, identity consistency, and control flexibility, comprehensively overcoming key bottlenecks—namely scene degradation and insufficient controllability—in existing customized image generation.
📝 Abstract
Recently remarkable progress has been made in synthesizing realistic human photos using text-to-image diffusion models. However, current approaches face degraded scenes, insufficient control, and suboptimal perceptual identity. We introduce CustomEnhancer, a novel framework to augment existing identity customization models. CustomEnhancer is a zero-shot enhancement pipeline that leverages face swapping techniques, pretrained diffusion model, to obtain additional representations in a zeroshot manner for encoding into personalized models. Through our proposed triple-flow fused PerGeneration approach, which identifies and combines two compatible counter-directional latent spaces to manipulate a pivotal space of personalized model, we unify the generation and reconstruction processes, realizing generation from three flows. Our pipeline also enables comprehensive training-free control over the generation process of personalized models, offering precise controlled personalization for them and eliminating the need for controller retraining for per-model. Besides, to address the high time complexity of null-text inversion (NTI), we introduce ResInversion, a novel inversion method that performs noise rectification via a pre-diffusion mechanism, reducing the inversion time by 129 times. Experiments demonstrate that CustomEnhancer reach SOTA results at scene diversity, identity fidelity, training-free controls, while also showing the efficiency of our ResInversion over NTI. The code will be made publicly available upon paper acceptance.