🤖 AI Summary
To address low facial resolution, identity (ID) confusion, and background distortion in multi-identity full-body portrait generation, this paper proposes FPGA—a plug-and-play end-to-end framework. Methodologically, it introduces (1) DDIM Inversion-driven ID Repair Inference (DIIR), the first approach to decouple precise facial detail restoration from background fidelity; (2) a Multimodal Fusion (MMF) training strategy that enhances target-ID representation specifically within facial regions; and (3) IDZoom, a million-scale multimodal dataset, coupled with RepControlNet for accelerated inference. Experiments demonstrate that FPGA achieves state-of-the-art performance across both objective and subjective metrics in multi-ID scenarios. On a single L20 GPU, inference completes in ≤2.5 seconds, enabling high-fidelity face swapping and cross-style ID transfer. The framework exhibits strong generalization and practical applicability.
📝 Abstract
Portrait Fidelity Generation is a prominent research area in generative models.Current methods face challenges in generating full-body images with low-resolution faces, especially in multi-ID photo phenomenon.To tackle these issues, we propose a comprehensive system called FPGA and construct a million-level multi-modal dataset IDZoom for training.FPGA consists of Multi-Mode Fusion training strategy (MMF) and DDIM Inversion based ID Restoration inference framework (DIIR). The MMF aims to activate the specified ID in the specified facial region. The DIIR aims to address the issue of face artifacts while keeping the background.Furthermore, DIIR is plug-and-play and can be applied to any diffusion-based portrait generation method to enhance their performance. DIIR is also capable of performing face-swapping tasks and is applicable to stylized faces as well.To validate the effectiveness of FPGA, we conducted extensive comparative and ablation experiments. The experimental results demonstrate that FPGA has significant advantages in both subjective and objective metrics, and achieves controllable generation in multi-ID scenarios. In addition, we accelerate the inference speed to within 2.5 seconds on a single L20 graphics card mainly based on our well designed reparameterization method, RepControlNet.