CulturalFrames: Assessing Cultural Expectation Alignment in Text-to-Image Models and Evaluation Metrics

📅 2025-06-10

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work systematically evaluates text-to-image (T2I) models’ capacity for cultural representation alignment across diverse sociocultural contexts. Addressing the critical gaps—namely, models’ neglect of culturally grounded expectations and the misalignment between existing automatic metrics and human judgment—we introduce CulturalFrames, the first cross-cultural benchmark covering 10 countries and 5 sociocultural domains, comprising 983 prompts, 3,637 generated images, and over 10,000 expert-annotated judgments. We propose a novel quantification paradigm for cultural expectation misalignment and establish the first human-centered evaluation framework specifically designed for cultural representation. Through multinational prompt engineering, large-scale crowdsourced assessment, and empirical analysis of leading models (e.g., SDXL, DALL·E 3), we find that models fail to satisfy cultural expectations in 44% of cases on average—68% explicitly and 49% implicitly—while standard automatic metrics exhibit near-zero correlation with human judgments.

Technology Category

Application Category

📝 Abstract

The increasing ubiquity of text-to-image (T2I) models as tools for visual content generation raises concerns about their ability to accurately represent diverse cultural contexts. In this work, we present the first study to systematically quantify the alignment of T2I models and evaluation metrics with respect to both explicit as well as implicit cultural expectations. To this end, we introduce CulturalFrames, a novel benchmark designed for rigorous human evaluation of cultural representation in visual generations. Spanning 10 countries and 5 socio-cultural domains, CulturalFrames comprises 983 prompts, 3637 corresponding images generated by 4 state-of-the-art T2I models, and over 10k detailed human annotations. We find that T2I models not only fail to meet the more challenging implicit expectations but also the less challenging explicit expectations. Across models and countries, cultural expectations are missed an average of 44% of the time. Among these failures, explicit expectations are missed at a surprisingly high average rate of 68%, while implicit expectation failures are also significant, averaging 49%. Furthermore, we demonstrate that existing T2I evaluation metrics correlate poorly with human judgments of cultural alignment, irrespective of their internal reasoning. Collectively, our findings expose critical gaps, providing actionable directions for developing more culturally informed T2I models and evaluation methodologies.

Problem

Research questions and friction points this paper is trying to address.

Assessing cultural expectation alignment in text-to-image models

Evaluating cultural representation in visual content generation

Identifying gaps in T2I models and evaluation metrics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces CulturalFrames benchmark for cultural evaluation

Evaluates T2I models across 10 countries and domains

Reveals poor cultural alignment in existing T2I metrics

🔎 Similar Papers

Beyond Aesthetics: Cultural Competence in Text-to-Image Models