🤖 AI Summary
This study identifies systematic performance degradation and service inequity in large language models (LLMs) on reasoning tasks formulated in African American Vernacular English (AAVE). To address this gap, we introduce ReDial—the first bilingual parallel reasoning benchmark comprising over 1,200 AAVE–Standard English query pairs—curated by native AAVE speakers, including computer science experts, who manually rewrote seven major reasoning benchmarks (e.g., GSM8K, HumanEval) across algorithmic, mathematical, and logical domains. We propose a dialect-aware reasoning evaluation framework and conduct cross-model experiments spanning GPT, Claude, Llama, Mistral, and Phi. Results demonstrate consistent accuracy drops of 15–40% across nearly all models when processing AAVE inputs, confirming substantial dialectal bias. We publicly release the ReDial dataset and evaluation code, establishing critical infrastructure for advancing linguistic fairness research in foundation models.
📝 Abstract
Language is not monolithic. While benchmarks, including those designed for multiple languages, are often used as proxies to evaluate the performance of Large Language Models (LLMs), they tend to overlook the nuances of within-language variation, and thus fail to model the experience of speakers of non-standard dialects. Focusing on African American Vernacular English (AAVE), we present the first study aimed at objectively assessing the fairness and robustness of LLMs in handling dialects in canonical reasoning tasks, including algorithm, math, logic, and integrated reasoning. We introduce extbf{ReDial} ( extbf{Re}asoning with extbf{Dial}ect Queries), a benchmark containing 1.2K+ parallel query pairs in Standardized English and AAVE. We hire AAVE speakers, including experts with computer science backgrounds, to rewrite seven popular benchmarks, such as HumanEval and GSM8K. With ReDial, we evaluate widely used LLMs, including GPT, Claude, Llama, Mistral, and the Phi model families. Our findings reveal that extbf{almost all of these widely used models show significant brittleness and unfairness to queries in AAVE}. Our work establishes a systematic and objective framework for analyzing LLM bias in dialectal queries. Moreover, it highlights how mainstream LLMs provide unfair service to dialect speakers in reasoning tasks, laying a critical foundation for relevant future research. Code and data can be accessed at https://github.com/fangru-lin/redial_dialect_robustness_fairness.