MuseFace: Text-driven Face Editing via Diffusion-based Mask Generation Approach

📅 2025-03-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing text-driven face editing methods struggle to simultaneously achieve diversity, controllability, and flexibility. To address this, we propose the first unified framework that integrates a diffusion-based Text-to-Mask model for generating fine-grained semantic masks, a semantics-aware GAN, and a multi-stage mask-guided editing mechanism—all underpinned by CLIP text embeddings for precise semantic alignment. Our architecture is the first to jointly optimize these three objectives within a single end-to-end pipeline, significantly improving spatial localization accuracy and semantic fidelity. Evaluated on multiple benchmarks, our method achieves state-of-the-art performance both quantitatively and qualitatively, outperforming prior approaches across diverse metrics. Notably, it enables high-fidelity face editing under complex natural language instructions, demonstrating unprecedented expressiveness and robustness in real-world scenarios.

Technology Category

Application Category

📝 Abstract
Face editing modifies the appearance of face, which plays a key role in customization and enhancement of personal images. Although much work have achieved remarkable success in text-driven face editing, they still face significant challenges as none of them simultaneously fulfill the characteristics of diversity, controllability and flexibility. To address this challenge, we propose MuseFace, a text-driven face editing framework, which relies solely on text prompt to enable face editing. Specifically, MuseFace integrates a Text-to-Mask diffusion model and a semantic-aware face editing model, capable of directly generating fine-grained semantic masks from text and performing face editing. The Text-to-Mask diffusion model provides extit{diversity} and extit{flexibility} to the framework, while the semantic-aware face editing model ensures extit{controllability} of the framework. Our framework can create fine-grained semantic masks, making precise face editing possible, and significantly enhancing the controllability and flexibility of face editing models. Extensive experiments demonstrate that MuseFace achieves superior high-fidelity performance.
Problem

Research questions and friction points this paper is trying to address.

Text-driven face editing lacks diversity, controllability, flexibility
Generating fine-grained semantic masks from text for precise editing
Enhancing face editing model controllability and flexibility via diffusion
Innovation

Methods, ideas, or system contributions that make the work stand out.

Text-to-Mask diffusion model for semantic masks
Semantic-aware face editing for controllability
Fine-grained masks enable precise face editing
🔎 Similar Papers
No similar papers found.
X
Xin Zhang
Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen, China; Xi’an Jiaotong University, Xi’an, China
S
Siting Huang
Xi’an Jiaotong University, Xi’an, China
X
Xiangyang Luo
Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen, China
Y
Yifan Xie
Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen, China; Xi’an Jiaotong University, Xi’an, China
Weijiang Yu
Weijiang Yu
Associate Professor, CSE, Sun Yat-sen University
Machine LearningMultimodal AIAI for Science
Heng Chang
Heng Chang
Tsinghua University
Trustworthy AIGraph Representation LearningData Mining
F
Fei Ma
Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen, China
F
Fei Yu
Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen, China