Bias in Gender Bias Benchmarks: How Spurious Features Distort Evaluation

📅 2025-09-09

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

Existing gender bias evaluation benchmarks for vision-language models (VLMs) suffer from severe methodological flaws: gender labels are often spuriously correlated with non-gender attributes—such as objects or background—leading to distorted bias scores. This work is the first to systematically expose the strong confounding effect of such spurious features on bias assessment. We propose a feature-sensitivity-based perturbation analysis framework, applying targeted interventions—including object masking and background blurring—to evaluate robustness across four major benchmarks for both generative and contrastive VLMs (e.g., CLIP variants). Experiments reveal that masking only 10% of salient objects or applying mild background blur induces up to 175% fluctuation in bias scores for generative models and 43% for CLIP-style models. Our findings fundamentally challenge the validity of current evaluation protocols and establish feature disentanglement and sensitivity measurement as essential criteria for trustworthy bias assessment in VLMs.

Technology Category

Application Category

📝 Abstract

Gender bias in vision-language foundation models (VLMs) raises concerns about their safe deployment and is typically evaluated using benchmarks with gender annotations on real-world images. However, as these benchmarks often contain spurious correlations between gender and non-gender features, such as objects and backgrounds, we identify a critical oversight in gender bias evaluation: Do spurious features distort gender bias evaluation? To address this question, we systematically perturb non-gender features across four widely used benchmarks (COCO-gender, FACET, MIAP, and PHASE) and various VLMs to quantify their impact on bias evaluation. Our findings reveal that even minimal perturbations, such as masking just 10% of objects or weakly blurring backgrounds, can dramatically alter bias scores, shifting metrics by up to 175% in generative VLMs and 43% in CLIP variants. This suggests that current bias evaluations often reflect model responses to spurious features rather than gender bias, undermining their reliability. Since creating spurious feature-free benchmarks is fundamentally challenging, we recommend reporting bias metrics alongside feature-sensitivity measurements to enable a more reliable bias assessment.

Problem

Research questions and friction points this paper is trying to address.

Evaluating spurious feature impact on gender bias benchmarks

Assessing reliability of gender bias evaluation in VLMs

Quantifying how non-gender features distort bias metrics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematic perturbation of non-gender benchmark features

Quantifying spurious feature impact on bias evaluation

Recommending feature-sensitivity measurements alongside bias metrics

🔎 Similar Papers

Causally Testing Gender Bias in LLMs: A Case Study on Occupational Bias