🤖 AI Summary
This study systematically evaluates social bias—particularly gender and racial stereotypes—in text-to-image (TTI) models’ occupational depictions. To this end, we construct a benchmark dataset covering five occupational categories and propose a fairness-aware prompt engineering framework. We conduct the first systematic comparison of five leading TTI models—DALL·E 3, Gemini Imagen 4.0, FLUX.1-dev, Stable Diffusion XL Turbo, and Grok-2 Image—on their responsiveness to diversity-promoting prompts. Human annotation analysis reveals that prompts significantly modulate demographic representation distributions; however, efficacy is highly model-dependent: some models achieve effective diversification, others exhibit overcorrection or negligible response. Our findings demonstrate the potential—and inherent limitations—of prompt engineering as a lightweight bias mitigation strategy, underscoring the necessity of co-designing prompt interventions with architectural improvements to achieve robust fairness in generative vision-language systems.
📝 Abstract
Text-to-Image (TTI) models are powerful creative tools but risk amplifying harmful social biases. We frame representational societal bias assessment as an image curation and evaluation task and introduce a pilot benchmark of occupational portrayals spanning five socially salient roles (CEO, Nurse, Software Engineer, Teacher, Athlete). Using five state-of-the-art models: closed-source (DALLE 3, Gemini Imagen 4.0) and open-source (FLUX.1-dev, Stable Diffusion XL Turbo, Grok-2 Image), we compare neutral baseline prompts against fairness-aware controlled prompts designed to encourage demographic diversity. All outputs are annotated for gender (male, female) and race (Asian, Black, White), enabling structured distributional analysis. Results show that prompting can substantially shift demographic representations, but with highly model-specific effects: some systems diversify effectively, others overcorrect into unrealistic uniformity, and some show little responsiveness. These findings highlight both the promise and the limitations of prompting as a fairness intervention, underscoring the need for complementary model-level strategies. We release all code and data for transparency and reproducibility https://github.com/maximus-powers/img-gen-bias-analysis.