🤖 AI Summary
Existing benchmarks inadequately assess how large language models misgender individuals—including those with non-binary gender identities—and perpetuate stereotypes in long-form texts. This work proposes ProText, a multidimensional dataset encompassing thematic nouns, categories, and pronouns, designed to systematically evaluate gender representation bias in tasks such as text rewriting and summarization. Moving beyond conventional pronoun resolution frameworks, our approach enables, for the first time, the analysis of misgendering in contexts involving non-binary identities or lacking explicit gender cues. The methodology integrates diverse long-text constructions, multidimensional annotations, and carefully designed prompting experiments. Preliminary case studies reveal that models exhibit systematic gender biases when gender information is absent or when influenced by default heteronormative assumptions.
📝 Abstract
We introduce ProText, a dataset for measuring gendering and misgendering in stylistically diverse long-form English texts. ProText spans three dimensions: Theme nouns (names, occupations, titles, kinship terms), Theme category (stereotypically male, stereotypically female, gender-neutral/non-gendered), and Pronoun category (masculine, feminine, gender-neutral, none). The dataset is designed to probe (mis)gendering in text transformations such as summarization and rewrites using state-of-the-art Large Language Models, extending beyond traditional pronoun resolution benchmarks and beyond the gender binary. We validated ProText through a mini case study, showing that even with just two prompts and two models, we can draw nuanced insights regarding gender bias, stereotyping, misgendering, and gendering. We reveal systematic gender bias, particularly when inputs contain no explicit gender cues or when models default to heteronormative assumptions.