Bias in the Picture: Benchmarking VLMs with Social-Cue News Images and LLM-as-Judge Assessment

📅 2025-09-23

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This study reveals that large vision-language models (VLMs) systematically reproduce societal biases—particularly gender–occupation associations—when interpreting news images. Method: We propose the first VLM bias evaluation paradigm integrating fine-grained social attribute annotation with an LLM-as-Judge framework. Our benchmark comprises 1,343 real-world news image–caption pairs and employs a hybrid assessment combining automated classification, LLM-based scoring, and human validation. Contribution/Results: We find that visual context significantly disrupts open-ended generation fidelity, and high output faithfulness does not imply low bias. VLMs exhibit substantial inter-model variation in bias magnitude, confirming the critical role of visual cues in reinforcing social stereotypes. We publicly release the annotated dataset, evaluation protocol, and implementation code, establishing a reproducible methodological foundation for fairness research in VLMs.

Technology Category

Application Category

📝 Abstract

Large vision-language models (VLMs) can jointly interpret images and text, but they are also prone to absorbing and reproducing harmful social stereotypes when visual cues such as age, gender, race, clothing, or occupation are present. To investigate these risks, we introduce a news-image benchmark consisting of 1,343 image-question pairs drawn from diverse outlets, which we annotated with ground-truth answers and demographic attributes (age, gender, race, occupation, and sports). We evaluate a range of state-of-the-art VLMs and employ a large language model (LLM) as judge, with human verification. Our findings show that: (i) visual context systematically shifts model outputs in open-ended settings; (ii) bias prevalence varies across attributes and models, with particularly high risk for gender and occupation; and (iii) higher faithfulness does not necessarily correspond to lower bias. We release the benchmark prompts, evaluation rubric, and code to support reproducible and fairness-aware multimodal assessment.

Problem

Research questions and friction points this paper is trying to address.

Evaluating bias risks in vision-language models from social-cue news images

Assessing how visual context systematically shifts model outputs in open-ended settings

Investigating bias prevalence across demographic attributes like gender and occupation

Innovation

Methods, ideas, or system contributions that make the work stand out.

News-image benchmark with demographic annotations

LLM-as-judge assessment with human verification

Evaluates bias prevalence across attributes and models

🔎 Similar Papers

Uncovering Bias in Large Vision-Language Models at Scale with Counterfactuals