FairViT-GAN: A Hybrid Vision Transformer with Adversarial Debiasing for Fair and Explainable Facial Beauty Prediction

📅 2025-09-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Facial Beauty Prediction (FBP) faces three key challenges: (1) CNNs struggle to capture holistic facial harmony, while ViTs often overlook fine-grained textural details; (2) models inherit and amplify societal biases—e.g., racial bias—during training; and (3) decision-making lacks interpretability. To address these, we propose a dual-branch CNN-ViT architecture with cross-branch attention, integrated with adversarial debiasing to enforce invariance of protected attributes (e.g., race) in learned representations. Feature disentanglement and attention-based visualization further enhance model transparency. Evaluated on SCUT-FBP5500, our method achieves state-of-the-art performance with a Pearson correlation coefficient of 0.9230 and RMSE of 0.2650. Crucially, inter-group performance disparity is reduced by 82.9%, and the adversarial classifier’s accuracy drops to 52.1%—near chance level—demonstrating substantial improvements in fairness and interpretability.

Technology Category

Application Category

📝 Abstract
Facial Beauty Prediction (FBP) has made significant strides with the application of deep learning, yet state-of-the-art models often exhibit critical limitations, including architectural constraints, inherent demographic biases, and a lack of transparency. Existing methods, primarily based on Convolutional Neural Networks (CNNs), excel at capturing local texture but struggle with global facial harmony, while Vision Transformers (ViTs) effectively model long-range dependencies but can miss fine-grained details. Furthermore, models trained on benchmark datasets can inadvertently learn and perpetuate societal biases related to protected attributes like ethnicity. To address these interconnected challenges, we propose extbf{FairViT-GAN}, a novel hybrid framework that synergistically integrates a CNN branch for local feature extraction and a ViT branch for global context modeling. More significantly, we introduce an adversarial debiasing mechanism where the feature extractor is explicitly trained to produce representations that are invariant to protected attributes, thereby actively mitigating algorithmic bias. Our framework's transparency is enhanced by visualizing the distinct focus of each architectural branch. Extensive experiments on the SCUT-FBP5500 benchmark demonstrate that FairViT-GAN not only sets a new state-of-the-art in predictive accuracy, achieving a Pearson Correlation of extbf{0.9230} and reducing RMSE to extbf{0.2650}, but also excels in fairness. Our analysis reveals a remarkable extbf{82.9% reduction in the performance gap} between ethnic subgroups, with the adversary's classification accuracy dropping to near-random chance (52.1%). We believe FairViT-GAN provides a robust, transparent, and significantly fairer blueprint for developing responsible AI systems for subjective visual assessment.
Problem

Research questions and friction points this paper is trying to address.

Addresses demographic bias in facial beauty prediction models
Combines CNN and ViT for local and global facial features
Enhances model transparency and reduces algorithmic bias
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid CNN-ViT architecture for local and global features
Adversarial debiasing mechanism for protected attribute invariance
Visualization techniques for enhanced framework transparency
🔎 Similar Papers
No similar papers found.