🤖 AI Summary
Vision-language models trained on large-scale datasets often exhibit demographic biases, yet the causal mechanisms linking data bias to model bias remain unclear—primarily due to the absence of fine-grained demographic annotations in web-scale image-text datasets (e.g., LAION-400M).
Method: We introduce the first large-scale annotated dataset covering 276 million human instances with gender, race, and semantic textual labels. We propose an automated annotation pipeline integrating object detection, multimodal caption generation, and fine-tuned classifiers, validated by human annotators for reliability.
Contribution/Results: Empirical analysis reveals that co-occurrence patterns between demographic attributes and textual contexts linearly explain 60–70% of gender bias in models. Notably, Black and Middle Eastern individuals are significantly over-associated with crime-related negative contexts. This bias directly propagates to downstream models—including CLIP and Stable Diffusion—causing skewed outputs. Our work establishes the first end-to-end empirical chain from dataset composition to model-level bias.
📝 Abstract
Vision-language models trained on large-scale multimodal datasets show strong demographic biases, but the role of training data in producing these biases remains unclear. A major barrier has been the lack of demographic annotations in web-scale datasets such as LAION-400M. We address this gap by creating person-centric annotations for the full dataset, including over 276 million bounding boxes, perceived gender and race/ethnicity labels, and automatically generated captions. These annotations are produced through validated automatic labeling pipelines combining object detection, multimodal captioning, and finetuned classifiers. Using them, we uncover demographic imbalances and harmful associations, such as the disproportionate linking of men and individuals perceived as Black or Middle Eastern with crime-related and negative content. We also show that 60-70% of gender bias in CLIP and Stable Diffusion can be linearly explained by direct co-occurrences in the data. Our resources establish the first large-scale empirical link between dataset composition and downstream model bias.