🤖 AI Summary
This study systematically investigates sources of bias and fairness challenges in facial expression recognition (FER). Addressing the problem of demographic disparities in FER performance, we conduct a comprehensive empirical analysis across four benchmark datasets—AffectNet, ExpW, Fer2013, and RAF-DB—and six representative models: MobileNet, ResNet, Xception, ViT, CLIP, and GPT-4o-mini. Our method employs fine-grained demographic annotations, cross-dataset generalization evaluation, and quantitative fairness metrics—including Equalized Odds and Demographic Parity. Contrary to common assumptions, we reveal for the first time that high-accuracy Transformer-based models (ViT and GPT-4o-mini) exhibit significantly greater group-level bias than lightweight CNNs. We further demonstrate that data imbalance and model architecture jointly exacerbate fairness degradation, uncovering a strong accuracy–fairness trade-off. To support reproducibility and future research, we release an open-source, fully reproducible experimental framework—providing both theoretical insights and practical guidelines for developing fairer FER systems.
📝 Abstract
Building AI systems, including Facial Expression Recognition (FER), involves two critical aspects: data and model design. Both components significantly influence bias and fairness in FER tasks. Issues related to bias and fairness in FER datasets and models remain underexplored. This study investigates bias sources in FER datasets and models. Four common FER datasets--AffectNet, ExpW, Fer2013, and RAF-DB--are analyzed. The findings demonstrate that AffectNet and ExpW exhibit high generalizability despite data imbalances. Additionally, this research evaluates the bias and fairness of six deep models, including three state-of-the-art convolutional neural network (CNN) models: MobileNet, ResNet, XceptionNet, as well as three transformer-based models: ViT, CLIP, and GPT-4o-mini. Experimental results reveal that while GPT-4o-mini and ViT achieve the highest accuracy scores, they also display the highest levels of bias. These findings underscore the urgent need for developing new methodologies to mitigate bias and ensure fairness in datasets and models, particularly in affective computing applications. See our implementation details at https://github.com/MMHosseini/bias_in_FER.