Diffusion-Based Makeup Transfer with Facial Region-Aware Makeup Features

📅 2026-03-20

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Existing diffusion-based makeup transfer methods struggle to accurately model makeup styles and lack fine-grained control over critical facial regions such as the eyes and lips. To address these limitations, this work proposes Facial Region-Aware Makeup features (FRAM): a novel approach that fine-tunes CLIP via self-supervised and image-text contrastive learning to construct a makeup-specific encoder. It introduces a learnable token query mechanism combined with ControlNet Union and a 3D facial mesh to enable region-level control, and further designs an attention-based loss to enhance local makeup alignment. The proposed method significantly improves both regional controllability and overall visual quality of makeup transfer while preserving identity characteristics, outperforming current state-of-the-art approaches in comprehensive experiments.

Technology Category

Application Category

📝 Abstract

Current diffusion-based makeup transfer methods commonly use the makeup information encoded by off-the-shelf foundation models (e.g., CLIP) as condition to preserve the makeup style of reference image in the generation. Although effective, these works mainly have two limitations: (1) foundation models pre-trained for generic tasks struggle to capture makeup styles; (2) the makeup features of reference image are injected to the diffusion denoising model as a whole for global makeup transfer, overlooking the facial region-aware makeup features (i.e., eyes, mouth, etc) and limiting the regional controllability for region-specific makeup transfer. To address these, in this work, we propose Facial Region-Aware Makeup features (FRAM), which has two stages: (1) makeup CLIP fine-tuning; (2) identity and facial region-aware makeup injection. For makeup CLIP fine-tuning, unlike prior works using off-the-shelf CLIP, we synthesize annotated makeup style data using GPT-o3 and text-driven image editing model, and then use the data to train a makeup CLIP encoder through self-supervised and image-text contrastive learning. For identity and facial region-aware makeup injection, we construct before-and-after makeup image pairs from the edited images in stage 1 and then use them to learn to inject identity of source image and makeup of reference image to the diffusion denoising model for makeup transfer. Specifically, we use learnable tokens to query the makeup CLIP encoder to extract facial region-aware makeup features for makeup injection, which is learned via an attention loss to enable regional control. As for identity injection, we use a ControlNet Union to encode source image and its 3D mesh simultaneously. The experimental results verify the superiority of our regional controllability and our makeup transfer performance.

Problem

Research questions and friction points this paper is trying to address.

makeup transfer

diffusion models

facial region-aware

foundation models

regional controllability

Innovation

Methods, ideas, or system contributions that make the work stand out.

diffusion-based makeup transfer

facial region-aware features

makeup CLIP fine-tuning