🤖 AI Summary
This study investigates how large language models (LLMs) may reproduce and amplify racial stereotypes in automated text annotation, thereby compromising the fairness of downstream research and decision-making. Through two large-scale experiments comprising over four million annotations, the authors systematically evaluate bias across 19 prominent LLMs on 39 annotation tasks under a multi-model, multi-task, and large-scale setting. Employing a controlled design based on ethnically associated names and matched dialects, combined with quantitative analysis, the work reveals systematic biases against Black, Asian, and Arab populations—for instance, judging African American English as less professional or more angry, attributing high intelligence but low confidence and sociability to Asian individuals, and exhibiting overcorrection only in hiring contexts. The study quantifies complex stereotyping patterns such as the “bamboo ceiling” and dialect-based prejudice, establishing a new paradigm for evaluating LLM fairness.
📝 Abstract
Large language models (LLMs) are increasingly used for automated text annotation in tasks ranging from academic research to content moderation and hiring. Across 19 LLMs and two experiments totaling more than 4 million annotation judgments, we show that subtle identity cues embedded in text systematically bias annotation outcomes in ways that mirror racial stereotypes. In a names-based experiment spanning 39 annotation tasks, texts containing names associated with Black individuals are rated as more aggressive by 18 of 19 models and more gossipy by 18 of 19. Asian names produce a bamboo-ceiling profile: 17 of 19 models rate individuals as more intelligent, while 18 of 19 rate them as less confident and less sociable. Arab names elicit cognitive elevation alongside interpersonal devaluation, and all four minority groups are consistently rated as less self-disciplined. In a matched dialect experiment, the same sentence is judged significantly less professional (all 19 models, mean gap $-0.774$), less indicative of an educated speaker ($-0.688$), more toxic (18/19), and more angry (19/19) when written in African American Vernacular English rather than Standard American English. A notable exception occurs for name-based hireability, where fine-tuning appears to overcorrect, systematically favoring minority-named applicants. These findings suggest that using LLMs as automated annotators can embed socially patterned biases directly into the datasets and measurements that increasingly underpin research, governance, and decision-making.