🤖 AI Summary
This work addresses the challenge of facial attribute editing, where simultaneously achieving precise attribute manipulation, preserving irrelevant content, and maintaining high image fidelity remains difficult. Existing GAN-based methods suffer from weak style–semantic alignment, while diffusion models often exhibit coupled semantic directions that hinder disentangled editing. To overcome these limitations, the authors propose a hybrid framework that integrates GANs and diffusion models: it decouples attribute editing from image generation via feature-level adversarial learning and subsequently guides the diffusion denoising process with the edited features. A PriorMapper module is introduced to enhance style–attribute alignment, and a Transformer-based RefineExtractor is designed to model global semantic relationships for improved editing accuracy. Experiments on CelebA-HQ demonstrate that the proposed method outperforms state-of-the-art approaches in both attribute editing precision and preservation of non-target attributes.
📝 Abstract
Facial attribute editing aims to modify target attributes while preserving attribute-irrelevant content and overall image fidelity. Existing GAN-based methods provide favorable controllability, but often suffer from weak alignment between style codes and attribute semantics. Diffusion-based methods can synthesize highly realistic images; however, their editing precision is limited by the entanglement of semantic directions among different attributes. In this paper, we propose AttDiff-GAN, a hybrid framework that combines GAN-based attribute manipulation with diffusion-based image generation. A key challenge in such integration lies in the inconsistency between one-step adversarial learning and multi-step diffusion denoising, which makes effective optimization difficult. To address this issue, we decouple attribute editing from image synthesis by introducing a feature-level adversarial learning scheme to learn explicit attribute manipulation, and then using the manipulated features to guide the diffusion process for image generation, while also removing the reliance on semantic direction-based editing. Moreover, we enhance style-attribute alignment by introducing PriorMapper, which incorporates facial priors into style generation, and RefineExtractor, which captures global semantic relationships through a Transformer for more precise style extraction. Experimental results on CelebA-HQ show that the proposed method achieves more accurate facial attribute editing and better preservation of non-target attributes than state-of-the-art methods in both qualitative and quantitative evaluations.