MAUGen: A Unified Diffusion Approach for Multi-Identity Facial Expression and AU Label Generation

📅 2026-01-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current facial action unit (AU) recognition systems are hindered by the scarcity of large-scale, demographically diverse facial image datasets with precise annotations for both AU occurrence and intensity. To address this limitation, this work proposes MAUGen—a multimodal generative framework based on diffusion models—that, for the first time, enables joint generation of multi-identity facial expression images and anatomically consistent, fine-grained AU labels (including occurrence and intensity) driven by textual descriptions. By integrating multimodal representation learning (MRL) with a diffusion-based image-label generator (DIG), the framework aligns text, identity, image, and AU signals within a unified latent space and constructs MIFA, a large-scale synthetic dataset. The generated images surpass existing methods in photorealism, demographic diversity, and semantic alignment with AUs, substantially improving the training efficacy and generalization of AU recognition models.

Technology Category

Application Category

📝 Abstract
The lack of large-scale, demographically diverse face images with precise Action Unit (AU) occurrence and intensity annotations has long been recognized as a fundamental bottleneck in developing generalizable AU recognition systems. In this paper, we propose MAUGen, a diffusion-based multi-modal framework that jointly generates a large collection of photorealistic facial expressions and anatomically consistent AU labels, including both occurrence and intensity, conditioned on a single descriptive text prompt. Our MAUGen involves two key modules: (1) a Multi-modal Representation Learning (MRL) module that captures the relationships among the paired textual description, facial identity, expression image, and AU activations within a unified latent space; and (2) a Diffusion-based Image label Generator (DIG) that decodes the joint representation into aligned facial image-label pairs across diverse identities. Under this framework, we introduce Multi-Identity Facial Action (MIFA), a large-scale multimodal synthetic dataset featuring comprehensive AU annotations and identity variations. Extensive experiments demonstrate that MAUGen outperforms existing methods in synthesizing photorealistic, demographically diverse facial images along with semantically aligned AU labels.
Problem

Research questions and friction points this paper is trying to address.

Action Unit
facial expression
data scarcity
demographic diversity
AU annotation
Innovation

Methods, ideas, or system contributions that make the work stand out.

diffusion model
action unit generation
multi-identity facial expression
multimodal representation learning
synthetic dataset
X
Xiangdong Li
School of Software Technology, Zhejiang University
Y
Ye Lou
School of Software Technology, Zhejiang University
A
Ao Gao
School of Software Technology, Zhejiang University
Wei Zhang
Wei Zhang
Zhejiang University
digital humanitiesdata visualization
Siyang Song
Siyang Song
Lecturer (AP), University of Exeter
Social Signal ProcessingAffective ComputingMachine LearningHuman-Computer Interaction