DemoBias: An Empirical Study to Trace Demographic Biases in Vision Foundation Models

📅 2025-08-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study empirically investigates demographic bias—specifically across race, gender, and age—in large vision-language models (LVLMs) for biometric face recognition and descriptive text generation. We introduce a newly constructed, demographically balanced dataset and systematically evaluate three prominent LVLMs: LLaVA, BLIP-2, and PaliGemma. Our method innovatively combines BERTScore with a proposed Fairness Discrepancy Rate (FDR) to quantify performance disparities across demographic subgroups. Results reveal pervasive demographic bias across all models: PaliGemma and LLaVA exhibit significant performance degradation on Black, Asian, and elderly subjects, whereas BLIP-2 demonstrates comparatively stronger fairness consistency. To our knowledge, this is the first work to conduct fine-grained, unified fairness evaluation across multiple LVLMs under a consistent experimental framework. The study establishes a reproducible benchmark for diagnosing bias in vision-language understanding and provides actionable insights for developing more equitable multimodal AI systems.

Technology Category

Application Category

📝 Abstract
Large Vision Language Models (LVLMs) have demonstrated remarkable capabilities across various downstream tasks, including biometric face recognition (FR) with description. However, demographic biases remain a critical concern in FR, as these foundation models often fail to perform equitably across diverse demographic groups, considering ethnicity/race, gender, and age. Therefore, through our work DemoBias, we conduct an empirical evaluation to investigate the extent of demographic biases in LVLMs for biometric FR with textual token generation tasks. We fine-tuned and evaluated three widely used pre-trained LVLMs: LLaVA, BLIP-2, and PaliGemma on our own generated demographic-balanced dataset. We utilize several evaluation metrics, like group-specific BERTScores and the Fairness Discrepancy Rate, to quantify and trace the performance disparities. The experimental results deliver compelling insights into the fairness and reliability of LVLMs across diverse demographic groups. Our empirical study uncovered demographic biases in LVLMs, with PaliGemma and LLaVA exhibiting higher disparities for Hispanic/Latino, Caucasian, and South Asian groups, whereas BLIP-2 demonstrated comparably consistent. Repository: https://github.com/Sufianlab/DemoBias.
Problem

Research questions and friction points this paper is trying to address.

Investigating demographic biases in vision foundation models
Evaluating fairness across ethnicity, gender, and age groups
Measuring performance disparities in biometric face recognition tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuned LLaVA, BLIP-2, PaliGemma models
Used demographic-balanced dataset for evaluation
Applied BERTScores and Fairness Discrepancy metrics
🔎 Similar Papers
No similar papers found.