VLBiasBench: A Comprehensive Benchmark for Evaluating Bias in Large Vision-Language Model

📅 2024-06-20

🏛️ arXiv.org

📈 Citations: 3

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This study systematically evaluates the output bias tendencies of large vision-language models (LVLMs) across nine unidimensional social attributes—including age, gender, race, and religion—as well as two intersectional dimensions (e.g., race × gender). To this end, we introduce BiasBench-VL, the first comprehensive bias benchmark specifically designed for LVLMs, comprising 46,848 synthetically generated images and 128,342 multi-format question-answer samples. Our methodology innovatively integrates Stable Diffusion XL-based generative augmentation, multi-prompt engineering, and structured bias annotation, while supporting both open- and closed-book evaluation protocols. Empirical evaluation across 17 state-of-the-art LVLMs reveals pronounced vulnerabilities to intersectional biases. BiasBench-VL is publicly released and has been widely adopted by the research community as a standard evaluation tool for bias assessment in vision-language modeling.

Technology Category

Application Category

📝 Abstract

The emergence of Large Vision-Language Models (LVLMs) marks significant strides towards achieving general artificial intelligence. However, these advancements are accompanied by concerns about biased outputs, a challenge that has yet to be thoroughly explored. Existing benchmarks are not sufficiently comprehensive in evaluating biases due to their limited data scale, single questioning format and narrow sources of bias. To address this problem, we introduce VLBiasBench, a comprehensive benchmark designed to evaluate biases in LVLMs. VLBiasBench, features a dataset that covers nine distinct categories of social biases, including age, disability status, gender, nationality, physical appearance, race, religion, profession, social economic status, as well as two intersectional bias categories: race x gender and race x social economic status. To build a large-scale dataset, we use Stable Diffusion XL model to generate 46,848 high-quality images, which are combined with various questions to creat 128,342 samples. These questions are divided into open-ended and close-ended types, ensuring thorough consideration of bias sources and a comprehensive evaluation of LVLM biases from multiple perspectives. We conduct extensive evaluations on 15 open-source models as well as two advanced closed-source models, yielding new insights into the biases present in these models. Our benchmark is available at https://github.com/Xiangkui-Cao/VLBiasBench.

Problem

Research questions and friction points this paper is trying to address.

Bias Evaluation

Visual Language Models

Socio-demographic Attributes

Innovation

Methods, ideas, or system contributions that make the work stand out.

VLBiasBench

Bias Detection

Visual Language Models

🔎 Similar Papers

B-AVIBench: Towards Evaluating the Robustness of Large Vision-Language Model on Black-box Adversarial Visual-Instructions