Can Vision Transformers with ResNet's Global Features Fairly Authenticate Demographic Faces?

📅 2025-06-03

🏛️ International Conference on Pattern Recognition

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Face authentication suffers from insufficient fairness and generalization across demographic groups (e.g., race, gender, age). To address this, we propose the first few-shot prototypical network framework explicitly designed for fair face authentication. Our method integrates pre-trained global features from Swin Transformer, ViT-L, ViT-H, and ResNet-18, augmented with a lightweight two-layer fully connected module for local feature modeling. We further construct a multi-dimensional demographic support/query dataset to enable rigorous fairness evaluation. Experiments demonstrate that Swin Transformer consistently outperforms other backbones under one-/three-/five-shot settings, achieving accuracy gains of 3.2–5.7% across most demographic subgroups and yielding more balanced cross-group performance. To foster reproducibility and community advancement, we publicly release both the source code and the benchmark dataset.

Technology Category

Application Category

📝 Abstract

Biometric face authentication is crucial in computer vision, but ensuring fairness and generalization across demographic groups remains a big challenge. Therefore, we investigated whether Vision Transformer (ViT) and ResNet, leveraging pre-trained global features, can fairly authenticate different demographic faces while relying minimally on local features. In this investigation, we used three pre-trained state-of-the-art (SOTA) ViT foundation models from Facebook, Google, and Microsoft for global features as well as ResNet-18. We concatenated the features from ViT and ResNet, passed them through two fully connected layers, and trained on customized face image datasets to capture the local features. Then, we designed a novel few-shot prototype network with backbone features embedding. We also developed new demographic face image support and query datasets for this empirical study. The network's testing was conducted on this dataset in one-shot, three-shot, and five-shot scenarios to assess how performance improves as the size of the support set increases. We observed results across datasets with varying races/ethnicities, genders, and age groups. The Microsoft Swin Transformer backbone performed better among the three SOTA ViT for this task. The code and data are available at: https://github.com/Sufianlab/FairVitBio.

Problem

Research questions and friction points this paper is trying to address.

Ensuring fairness in face authentication across demographic groups

Combining Vision Transformers and ResNet for global feature authentication

Evaluating performance with few-shot learning on diverse demographic datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combined ViT and ResNet features for authentication

Novel few-shot prototype network with embedding

Tested on diverse demographic datasets

🔎 Similar Papers

Open-Set Deepfake Detection: A Parameter-Efficient Adaptation Method with Forgery Style Mixture

2024-08-23arXiv.orgCitations: 5

Authors to Follow