🤖 AI Summary
In pose-guided portrait image synthesis (PGPIS), latent diffusion models (LDMs) suffer from degradation of fine-grained details—particularly in facial and garment textures—due to aggressive latent-space compression. To address this, we propose the Multi-Focus Conditional Aggregation (MFCA) module, the first to achieve pose-invariant disentanglement and adaptive fusion of identity and texture features. MFCA integrates multi-scale feature disentanglement, conditional feature aggregation, and pose-invariant representation learning within the LDM framework, significantly enhancing structural fidelity at fine-grained levels. Experiments on DeepFashion demonstrate substantial improvements: +12.7% in identity preservation (ID-Retrieval) and −9.3% in perceptual distortion (LPIPS↓), enabling high-fidelity, highly controllable portrait editing. The source code is publicly available.
📝 Abstract
The Latent Diffusion Model (LDM) has demonstrated strong capabilities in high-resolution image generation and has been widely employed for Pose-Guided Person Image Synthesis (PGPIS), yielding promising results. However, the compression process of LDM often results in the deterioration of details, particularly in sensitive areas such as facial features and clothing textures. In this paper, we propose a Multi-focal Conditioned Latent Diffusion (MCLD) method to address these limitations by conditioning the model on disentangled, pose-invariant features from these sensitive regions. Our approach utilizes a multi-focal condition aggregation module, which effectively integrates facial identity and texture-specific information, enhancing the model's ability to produce appearance realistic and identity-consistent images. Our method demonstrates consistent identity and appearance generation on the DeepFashion dataset and enables flexible person image editing due to its generation consistency. The code is available at https://github.com/jqliu09/mcld.