🤖 AI Summary
To address inconsistent cross-modal embeddings in CLIP—leading to frequent artifacts, poor attribute disentanglement, and unreliable control in text-guided face editing—this paper proposes Projection-Augmented Embeddings (PAE). PAE constructs a semantic subspace from relevant text prompts within CLIP’s joint embedding space, achieves feature disentanglement via PCA-based dimensionality reduction and orthogonal projection, and enhances target consistency through embedded feature fusion. Crucially, PAE is the first method to incorporate prompt subspace projection directly into the editing optimization objective, markedly improving interpretability, cross-attribute independence, and fine-grained controllability. Compatible with mainstream CLIP-driven frameworks, PAE achieves state-of-the-art performance on face editing tasks, yielding 12% improvements in both LPIPS and ID similarity metrics. Qualitative evaluation confirms superior disentanglement and semantic fidelity over baseline methods.
📝 Abstract
Recently introduced Contrastive Language-Image Pre-Training (CLIP) [Radford et al. 2021] bridges images and text by embedding them into a joint latent space. This opens the door to ample literature that aims to manipulate an input image by providing a textual explanation. However, due to the discrepancy between image and text embeddings in the joint space, using text embeddings as the optimization target often introduces undesired artifacts in the resulting images. Disentanglement, interpretability, and controllability are also hard to guarantee for manipulation. To alleviate these problems, we propose to define corpus subspaces spanned by relevant prompts to capture specific image characteristics. We introduce CLIP projection-augmentation embedding (PAE) as an optimization target to improve the performance of text-guided image manipulation. Our method is a simple and general paradigm that can be easily computed and adapted, and smoothly incorporated into any CLIP-based image manipulation algorithm. To demonstrate the effectiveness of our method, we conduct several theoretical and empirical studies. As a case study, we utilize the method for text-guided semantic face editing. We quantitatively and qualitatively demonstrate that PAE facilitates a more disentangled, interpretable, and controllable face image manipulation with state-of-the-art quality and accuracy.