🤖 AI Summary
Current facial action unit (AU) recognition systems are hindered by the scarcity of large-scale, demographically diverse facial image datasets with precise annotations for both AU occurrence and intensity. To address this limitation, this work proposes MAUGen—a multimodal generative framework based on diffusion models—that, for the first time, enables joint generation of multi-identity facial expression images and anatomically consistent, fine-grained AU labels (including occurrence and intensity) driven by textual descriptions. By integrating multimodal representation learning (MRL) with a diffusion-based image-label generator (DIG), the framework aligns text, identity, image, and AU signals within a unified latent space and constructs MIFA, a large-scale synthetic dataset. The generated images surpass existing methods in photorealism, demographic diversity, and semantic alignment with AUs, substantially improving the training efficacy and generalization of AU recognition models.
📝 Abstract
The lack of large-scale, demographically diverse face images with precise Action Unit (AU) occurrence and intensity annotations has long been recognized as a fundamental bottleneck in developing generalizable AU recognition systems. In this paper, we propose MAUGen, a diffusion-based multi-modal framework that jointly generates a large collection of photorealistic facial expressions and anatomically consistent AU labels, including both occurrence and intensity, conditioned on a single descriptive text prompt. Our MAUGen involves two key modules: (1) a Multi-modal Representation Learning (MRL) module that captures the relationships among the paired textual description, facial identity, expression image, and AU activations within a unified latent space; and (2) a Diffusion-based Image label Generator (DIG) that decodes the joint representation into aligned facial image-label pairs across diverse identities. Under this framework, we introduce Multi-Identity Facial Action (MIFA), a large-scale multimodal synthetic dataset featuring comprehensive AU annotations and identity variations. Extensive experiments demonstrate that MAUGen outperforms existing methods in synthesizing photorealistic, demographically diverse facial images along with semantically aligned AU labels.