CuRe: Cultural Gaps in the Long Tail of Text-to-Image Systems

📅 2025-06-09

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This work addresses the underrepresentation of Global South cultures in text-to-image (T2I) models, stemming from Eurocentric training data. To this end, we introduce CuRe—the first scalable, culturally grounded benchmark explicitly designed for the Global South. CuRe constructs a six-dimensional cultural hierarchy dataset (300 culturally significant entities) derived from the Wikimedia knowledge graph. It innovatively employs attribute-specified marginal utility as a proxy metric for fine-grained evaluation of cultural representation. Our methodology integrates crowdsourced cultural knowledge modeling with multimodal encoders (SigLIP-2, DINOv2), vision-language models (OpenCLIP, Gemini 2.0 Flash), and leading T2I systems (Stable Diffusion 1.5/XL/3.5, FLUX.1, DALL-E 3). Experiments demonstrate strong agreement between CuRe scores and human judgments across perceptual similarity, text–image alignment, and cultural diversity (average Spearman ρ > 0.82). The benchmark dataset, evaluation framework, and code are publicly released.

Technology Category

Application Category

📝 Abstract

Popular text-to-image (T2I) systems are trained on web-scraped data, which is heavily Amero and Euro-centric, underrepresenting the cultures of the Global South. To analyze these biases, we introduce CuRe, a novel and scalable benchmarking and scoring suite for cultural representativeness that leverages the marginal utility of attribute specification to T2I systems as a proxy for human judgments. Our CuRe benchmark dataset has a novel categorical hierarchy built from the crowdsourced Wikimedia knowledge graph, with 300 cultural artifacts across 32 cultural subcategories grouped into six broad cultural axes (food, art, fashion, architecture, celebrations, and people). Our dataset's categorical hierarchy enables CuRe scorers to evaluate T2I systems by analyzing their response to increasing the informativeness of text conditioning, enabling fine-grained cultural comparisons. We empirically observe much stronger correlations of our class of scorers to human judgments of perceptual similarity, image-text alignment, and cultural diversity across image encoders (SigLIP 2, AIMV2 and DINOv2), vision-language models (OpenCLIP, SigLIP 2, Gemini 2.0 Flash) and state-of-the-art text-to-image systems, including three variants of Stable Diffusion (1.5, XL, 3.5 Large), FLUX.1 [dev], Ideogram 2.0, and DALL-E 3. The code and dataset is open-sourced and available at https://aniketrege.github.io/cure/.

Problem

Research questions and friction points this paper is trying to address.

Assesses cultural bias in text-to-image systems

Measures underrepresented Global South cultures

Evaluates T2I responses to cultural attributes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages marginal utility for cultural representativeness scoring

Uses Wikimedia knowledge graph for hierarchical dataset

Correlates scorers with human judgments across models

🔎 Similar Papers

How Culturally Aware are Vision-Language Models?