GRUFF: LLM Pronoun Fidelity, Reasoning, and Biases in German

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

This study addresses the lack of systematic evaluation of large language models regarding German pronoun coreference consistency, reasoning capabilities, and gender bias—particularly with respect to neopronouns and the language’s complex grammatical gender system. The authors introduce GRUFF, a large-scale synthetically constructed dataset encompassing four noun gender agreement paradigms and four pronoun types, including the neopronouns *xier* and *en*, to assess for the first time encoder-only and decoder-only models’ ability to correctly reuse specified pronouns under distracting conditions. Results indicate that models exhibit strong adherence to traditional masculine–feminine grammatical agreement but perform significantly worse with neopronouns. Encoder-only architectures demonstrate superior robustness to interference in German compared to English, and occupational stereotypes show weak cross-case correlation, remaining consistent only within similar model architectures.

📝 Abstract

Third-person singular pronouns have long been used to study stereotypical biases in language models and to test their abilities to reason about reference. More recently, the interplay between reasoning and bias has been investigated with the task of pronoun fidelity, which assesses models' abilities to correctly reuse a previously-specified pronoun for a discourse entity, independent of other potentially distracting discourse entities mentioned in between. However, such research focuses on English, which is a language with limited grammatical gender and almost no gender agreement. In this paper we contribute a novel, large-scale dataset, GRUFF, to measure pronoun fidelity in German, covering four different gender agreement systems in nouns, and four sets of pronouns. With this dataset, we show that LLMs show strong grammatical agreement for masculine and feminine entities in the absence of explicit context, but not for neopronouns xier and en. Models are generally not robust to distractors, but encoder-only models are more robust in German than in English, reflecting the importance of grammatical gender. Finally, we show that occupational stereotypes in this context are poorly correlated across grammatical cases, and across most models, except ones with closely related architectures. We release all code and data to encourage further work on gender-inclusive language and referential reasoning in German.

Problem

Research questions and friction points this paper is trying to address.

pronoun fidelity

grammatical gender

language models

stereotypical biases

referential reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

pronoun fidelity

grammatical gender

referential reasoning