Rejected Dialects: Biases Against African American Language in Reward Models

📅 2025-02-18

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This work reveals systematic bias in large language model (LLM) reward models against African American Language (AAL): their preference judgments on AAL texts significantly diverge from human evaluations, exhibiting a 4% lower average accuracy compared to Mainstream White English (MWE), and consistently underestimating response quality while actively steering outputs toward MWE. To address this, we introduce the first dialectal bias evaluation framework specifically designed for reward modeling—integrating high-quality, human-curated and machine-translated AAL corpora. Using paired comparative experiments, reward score deviation analysis, and dialogue turn-tracking, we localize the bias origin to the reward modeling stage itself. Our findings highlight representational harm risks and provide both methodological foundations and empirical evidence for extending LLM fairness research to linguistic diversity.

Technology Category

Application Category

📝 Abstract

Preference alignment via reward models helps build safe, helpful, and reliable large language models (LLMs). However, subjectivity in preference judgments and the lack of representative sampling in preference data collection can introduce new biases, hindering reward models' fairness and equity. In this work, we introduce a framework for evaluating dialect biases in reward models and conduct a case study on biases against African American Language (AAL) through several experiments comparing reward model preferences and behavior on paired White Mainstream English (WME) and both machine-translated and human-written AAL corpora. We show that reward models are less aligned with human preferences when processing AAL texts vs. WME ones (-4% accuracy on average), frequently disprefer AAL-aligned texts vs. WME-aligned ones, and steer conversations toward WME, even when prompted with AAL texts. Our findings provide a targeted analysis of anti-AAL biases at a relatively understudied stage in LLM development, highlighting representational harms and ethical questions about the desired behavior of LLMs concerning AAL.

Problem

Research questions and friction points this paper is trying to address.

Detects biases in reward models

Focuses on African American Language

Compares AAL and WME processing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates dialect biases in reward models

Compares African American Language and White Mainstream English

Highlights anti-AAL biases in LLM development

🔎 Similar Papers

Native Design Bias: Studying the Impact of English Nativeness on Language Model Performance