🤖 AI Summary
This work addresses the absence of a unified and adaptable framework for evaluating gender bias in text-to-image generation models, which hinders alignment with the diverse risk governance requirements across application contexts. To bridge this gap, the authors propose a risk-aligned auditing framework that innovatively introduces the THUMB card mechanism, systematically integrating usage context, manifestations of bias, harm hypotheses, and auditing strategies to enable context-aware bias assessment. Grounded in the European Union AI Act’s risk classification, the framework constructs risk-tiered use-case profiles, a catalog of bias metrics spanning gender prediction, embedding similarity, and downstream task performance, and a contextualized harm typology. This integrated approach yields an interpretable and actionable audit pipeline, substantially enhancing the practicality and relevance of gender bias evaluation in both technical auditing and AI governance.
📝 Abstract
Text-to-image (T2I) generative models are increasingly used to produce content for education, media, and public-facing communication, and are starting to be integrated into higher-impact pipelines. Since generated images tend to reinforce stereotypes, producing representational erasure via "default" depictions and shaping perceptions of who belongs in certain roles, a growing body of work has proposed metrics to quantify gender bias in T2I outputs. Yet existing evaluations remain fragmented. Metrics are often reported without a shared view of what they measure, what assumptions they entail, or how their results should be interpreted under different deployment contexts. This limits the usefulness of gender bias measurement for both technical auditing and emerging governance discussions. We propose a risk-aligned auditing framework for gender bias in T2I models composed of three constituents that connects risk categories, evaluation metrics, and harms. First, we identify risk-tiered use-case profiles aligned with the EU AI Act's risk categories to motivate why auditing expectations may vary with deployment contexts and stakeholder exposure. Second, we construct a metric catalog that consolidates gender-bias evaluation methods and organizes them in three measurement categories: gender prediction, embedding similarity, and downstream task. Third, we introduce a harm typology that maps context-dependent harm categories (e.g., representational, quality-of-service) to specific risk-tired scenarios. Finally, we introduce THUMB cards (Text-to-image Harms-informed Use-case-aligned Metrics of gender Bias) that help formulate auditing systematically by the incorporation of context, scenario and bias manifestation, harm hypotheses, and audit strategy.