🤖 AI Summary
Facial expression datasets suffer from limited scale due to subjective labeling, high acquisition costs, and privacy constraints—hindering deep learning models, especially foundation models. To address this, we propose SynFER, the first synthesis framework integrating textual semantic descriptions with fine-grained facial Action Unit (AU) control. It employs semantic-guided generation coupled with iterative pseudo-label refinement to significantly enhance synthetic image fidelity and label reliability. Built upon diffusion models, SynFER enables high-fidelity, controllable facial expression synthesis. Using synthetic data equivalent in scale to AffectNet, it achieves 67.23% accuracy; scaling to five times that size yields 69.84%, approaching performance attained with real-data training. SynFER establishes a scalable, interpretable, and privacy-preserving paradigm for synthetic data generation in low-resource facial expression analysis.
📝 Abstract
Facial expression datasets remain limited in scale due to privacy concerns, the subjectivity of annotations, and the labor-intensive nature of data collection. This limitation poses a significant challenge for developing modern deep learning-based facial expression analysis models, particularly foundation models, that rely on large-scale data for optimal performance. To tackle the overarching and complex challenge, we introduce SynFER (Synthesis of Facial Expressions with Refined Control), a novel framework for synthesizing facial expression image data based on high-level textual descriptions as well as more fine-grained and precise control through facial action units. To ensure the quality and reliability of the synthetic data, we propose a semantic guidance technique to steer the generation process and a pseudo-label generator to help rectify the facial expression labels for the synthetic images. To demonstrate the generation fidelity and the effectiveness of the synthetic data from SynFER, we conduct extensive experiments on representation learning using both synthetic data and real-world data. Experiment results validate the efficacy of the proposed approach and the synthetic data. Notably, our approach achieves a 67.23% classification accuracy on AffectNet when training solely with synthetic data equivalent to the AffectNet training set size, which increases to 69.84% when scaling up to five times the original size. Our code will be made publicly available.