🤖 AI Summary
Existing makeup generation methods rely on multi-model pipelines and cross-domain feature alignment, lacking text-driven zero-shot virtual try-on capabilities. This paper proposes the first unified single-model framework that jointly supports multiple image editing tasks—including beautification filtering, makeup transfer, and makeup removal—as well as text-guided makeup synthesis, all in an end-to-end manner. Key contributions include: (1) a cross-domain diffusion architecture with switchable domain embeddings for flexible task control; (2) MT-Text, the first fine-grained annotated makeup dataset with aligned text-image pairs, enabling robust text-image alignment training; and (3) data augmentation and joint optimization strategies achieving state-of-the-art performance across all tasks. Our approach significantly reduces deployment complexity and, for the first time, enables zero-shot, text-controllable, and multi-task-compatible makeup editing.
📝 Abstract
Existing makeup techniques often require designing multiple models to handle different inputs and align features across domains for different makeup tasks, e.g., beauty filter, makeup transfer, and makeup removal, leading to increased complexity. Another limitation is the absence of text-guided makeup try-on, which is more user-friendly without needing reference images. In this study, we make the first attempt to use a single model for various makeup tasks. Specifically, we formulate different makeup tasks as cross-domain translations and leverage a cross-domain diffusion model to accomplish all tasks. Unlike existing methods that rely on separate encoder-decoder configurations or cycle-based mechanisms, we propose using different domain embeddings to facilitate domain control. This allows for seamless domain switching by merely changing embeddings with a single model, thereby reducing the reliance on additional modules for different tasks. Moreover, to support precise text-to-makeup applications, we introduce the MT-Text dataset by extending the MT dataset with textual annotations, advancing the practicality of makeup technologies.