Person-Centric Annotations of LAION-400M: Auditing Bias and Its Transfer to Models

📅 2025-10-04

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Vision-language models trained on large-scale datasets often exhibit demographic biases, yet the causal mechanisms linking data bias to model bias remain unclear—primarily due to the absence of fine-grained demographic annotations in web-scale image-text datasets (e.g., LAION-400M). Method: We introduce the first large-scale annotated dataset covering 276 million human instances with gender, race, and semantic textual labels. We propose an automated annotation pipeline integrating object detection, multimodal caption generation, and fine-tuned classifiers, validated by human annotators for reliability. Contribution/Results: Empirical analysis reveals that co-occurrence patterns between demographic attributes and textual contexts linearly explain 60–70% of gender bias in models. Notably, Black and Middle Eastern individuals are significantly over-associated with crime-related negative contexts. This bias directly propagates to downstream models—including CLIP and Stable Diffusion—causing skewed outputs. Our work establishes the first end-to-end empirical chain from dataset composition to model-level bias.

Technology Category

Application Category

📝 Abstract

Vision-language models trained on large-scale multimodal datasets show strong demographic biases, but the role of training data in producing these biases remains unclear. A major barrier has been the lack of demographic annotations in web-scale datasets such as LAION-400M. We address this gap by creating person-centric annotations for the full dataset, including over 276 million bounding boxes, perceived gender and race/ethnicity labels, and automatically generated captions. These annotations are produced through validated automatic labeling pipelines combining object detection, multimodal captioning, and finetuned classifiers. Using them, we uncover demographic imbalances and harmful associations, such as the disproportionate linking of men and individuals perceived as Black or Middle Eastern with crime-related and negative content. We also show that 60-70% of gender bias in CLIP and Stable Diffusion can be linearly explained by direct co-occurrences in the data. Our resources establish the first large-scale empirical link between dataset composition and downstream model bias.

Problem

Research questions and friction points this paper is trying to address.

Auditing demographic bias in vision-language training datasets

Analyzing how dataset imbalances transfer to model biases

Establishing empirical links between data composition and model behavior

Innovation

Methods, ideas, or system contributions that make the work stand out.

Person-centric annotations for LAION-400M dataset

Automatic labeling pipelines combining detection and classification

Empirical link between dataset composition and model bias

🔎 Similar Papers

Human and LLM Biases in Hate Speech Annotations: A Socio-Demographic Analysis of Annotators and Targets