MAUGen: A Unified Diffusion Approach for Multi-Identity Facial Expression and AU Label Generation

📅 2026-01-31

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

Current facial action unit (AU) recognition systems are hindered by the scarcity of large-scale, demographically diverse facial image datasets with precise annotations for both AU occurrence and intensity. To address this limitation, this work proposes MAUGen—a multimodal generative framework based on diffusion models—that, for the first time, enables joint generation of multi-identity facial expression images and anatomically consistent, fine-grained AU labels (including occurrence and intensity) driven by textual descriptions. By integrating multimodal representation learning (MRL) with a diffusion-based image-label generator (DIG), the framework aligns text, identity, image, and AU signals within a unified latent space and constructs MIFA, a large-scale synthetic dataset. The generated images surpass existing methods in photorealism, demographic diversity, and semantic alignment with AUs, substantially improving the training efficacy and generalization of AU recognition models.

Technology Category

Application Category

📝 Abstract

The lack of large-scale, demographically diverse face images with precise Action Unit (AU) occurrence and intensity annotations has long been recognized as a fundamental bottleneck in developing generalizable AU recognition systems. In this paper, we propose MAUGen, a diffusion-based multi-modal framework that jointly generates a large collection of photorealistic facial expressions and anatomically consistent AU labels, including both occurrence and intensity, conditioned on a single descriptive text prompt. Our MAUGen involves two key modules: (1) a Multi-modal Representation Learning (MRL) module that captures the relationships among the paired textual description, facial identity, expression image, and AU activations within a unified latent space; and (2) a Diffusion-based Image label Generator (DIG) that decodes the joint representation into aligned facial image-label pairs across diverse identities. Under this framework, we introduce Multi-Identity Facial Action (MIFA), a large-scale multimodal synthetic dataset featuring comprehensive AU annotations and identity variations. Extensive experiments demonstrate that MAUGen outperforms existing methods in synthesizing photorealistic, demographically diverse facial images along with semantically aligned AU labels.

Problem

Research questions and friction points this paper is trying to address.

Action Unit

facial expression

data scarcity

demographic diversity

AU annotation

Innovation

Methods, ideas, or system contributions that make the work stand out.

diffusion model

action unit generation

multi-identity facial expression

multimodal representation learning

synthetic dataset

🔎 Similar Papers

DiffusionAct: Controllable Diffusion Autoencoder for One-shot Face Reenactment

2024-03-25arXiv.orgCitations: 6

CFCPalsy: Facial Image Synthesis with Cross-Fusion Cycle Diffusion Model for Facial Paralysis Individuals

2024-09-11Citations: 1