Culture in Action: Evaluating Text-to-Image Models through Social Activities

📅 2025-11-07

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Text-to-image (T2I) diffusion models, trained on large-scale web data, achieve high photorealism but exhibit pervasive cultural biases—particularly misrepresenting Global South nations and everyday sociocultural practices. Existing benchmarks focus on static objects (e.g., food, architecture) and lack evaluation of culturally normative activities (e.g., greetings, dances, dining), as well as reliable metrics for cultural fidelity. Method: We introduce CULTIVate, the first benchmark dedicated to cross-cultural social activities, covering 16 countries. We propose a multidimensional evaluation framework with four novel metrics: cultural alignment, hallucination, exaggeration, and diversity. Evaluation employs descriptor-based, interpretable scoring, validated on 19K generated images and human studies. Contribution/Results: Our metrics demonstrate strong correlation with human judgments—significantly outperforming existing T2I evaluation methods—and provide the first systematic, culture-aware assessment of generative image models’ sociocultural grounding.

Technology Category

Application Category

📝 Abstract

Text-to-image (T2I) diffusion models achieve impressive photorealism by training on large-scale web data, but models inherit cultural biases and fail to depict underrepresented regions faithfully. Existing cultural benchmarks focus mainly on object-centric categories (e.g., food, attire, and architecture), overlooking the social and daily activities that more clearly reflect cultural norms. Few metrics exist for measuring cultural faithfulness. We introduce CULTIVate, a benchmark for evaluating T2I models on cross-cultural activities (e.g., greetings, dining, games, traditional dances, and cultural celebrations). CULTIVate spans 16 countries with 576 prompts and more than 19,000 images, and provides an explainable descriptor-based evaluation framework across multiple cultural dimensions, including background, attire, objects, and interactions. We propose four metrics to measure cultural alignment, hallucination, exaggerated elements, and diversity. Our findings reveal systematic disparities: models perform better for global north countries than for the global south, with distinct failure modes across T2I systems. Human studies confirm that our metrics correlate more strongly with human judgments than existing text-image metrics.

Problem

Research questions and friction points this paper is trying to address.

Evaluating cultural biases in text-to-image models across underrepresented regions

Developing metrics to measure cultural faithfulness in social activities

Addressing systematic performance disparities between global north and south

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark for cross-cultural activity evaluation

Descriptor-based framework across cultural dimensions

Four metrics measuring cultural alignment and diversity

🔎 Similar Papers

How Culturally Aware are Vision-Language Models?