CusEnhancer: A Zero-Shot Scene and Controllability Enhancement Method for Photo Customization via ResInversion

📅 2025-09-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address scene degradation, weak controllability, and low identity fidelity in text-to-image diffusion models for generating realistic human portraits, this paper proposes CustomEnhancer—a zero-shot, fine-tuning-free framework for personalized portrait enhancement. Its core comprises a three-stream fused PerGeneration architecture for generative modeling and ResInversion, a pre-diffusion-based inversion technique that jointly unifies generation and reconstruction. ResInversion accelerates inverse inference by 129× over NTI, drastically reducing computational overhead; the three-stream fusion enables precise cross-model control and high-fidelity identity reconstruction. Experiments demonstrate state-of-the-art performance in scene diversity, identity consistency, and control flexibility, comprehensively overcoming key bottlenecks—namely scene degradation and insufficient controllability—in existing customized image generation.

Technology Category

Application Category

📝 Abstract
Recently remarkable progress has been made in synthesizing realistic human photos using text-to-image diffusion models. However, current approaches face degraded scenes, insufficient control, and suboptimal perceptual identity. We introduce CustomEnhancer, a novel framework to augment existing identity customization models. CustomEnhancer is a zero-shot enhancement pipeline that leverages face swapping techniques, pretrained diffusion model, to obtain additional representations in a zeroshot manner for encoding into personalized models. Through our proposed triple-flow fused PerGeneration approach, which identifies and combines two compatible counter-directional latent spaces to manipulate a pivotal space of personalized model, we unify the generation and reconstruction processes, realizing generation from three flows. Our pipeline also enables comprehensive training-free control over the generation process of personalized models, offering precise controlled personalization for them and eliminating the need for controller retraining for per-model. Besides, to address the high time complexity of null-text inversion (NTI), we introduce ResInversion, a novel inversion method that performs noise rectification via a pre-diffusion mechanism, reducing the inversion time by 129 times. Experiments demonstrate that CustomEnhancer reach SOTA results at scene diversity, identity fidelity, training-free controls, while also showing the efficiency of our ResInversion over NTI. The code will be made publicly available upon paper acceptance.
Problem

Research questions and friction points this paper is trying to address.

Enhancing scene diversity and controllability in photo customization models
Addressing degraded scenes and suboptimal identity fidelity in personalized generation
Reducing high time complexity of existing inversion methods like NTI
Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-shot enhancement via face swapping
Triple-flow fusion for unified generation
ResInversion reduces inversion time 129x
🔎 Similar Papers
No similar papers found.
M
Maoye Ren
College of Computer Science and Technology, East China University of Science and Technology, Shanghai 200237, China
Praneetha Vaddamanu
Praneetha Vaddamanu
Applied Scientist, Microsoft Turing
Computer ScienceArtificial Intelligence
Jianjin Xu
Jianjin Xu
CMU Robotics Institute
F
Fernando De la Torre Frade
College of Computer Science, Carnegie Mellon University, Pittsburgh, 15213, Pennsylvania