Analysis Of Linguistic Stereotypes in Single and Multi-Agent Generative AI Architectures

📅 2026-03-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses systematic dialect-based biases in large language models (LLMs) when processing Standard American English (SAE) versus African American English (AAE), particularly in the assignment of occupations, names, and adjectives. The authors construct eight prompt templates to systematically compare, for the first time, single-agent approaches—including role prompting and chain-of-thought reasoning—with multi-agent architectures featuring generate-critique-revise workflows in mitigating such biases. Using an LLM-as-judge evaluation paradigm across multiple mainstream models, the experiments reveal significant SAE–AAE disparities in all tested models, with Claude Haiku exhibiting the largest bias and Phi-4 Mini the smallest. Chain-of-thought reasoning effectively reduces bias in Haiku, while the multi-agent framework consistently mitigates bias across all models, underscoring the importance of workflow-level interventions for ensuring fairness.

Technology Category

Application Category

📝 Abstract
Many works in the literature show that LLM outputs exhibit discriminatory behaviour, triggering stereotype-based inferences based on the dialect in which the inputs are written. This bias has been shown to be particularly pronounced when the same inputs are provided to LLMs in Standard American English (SAE) and African-American English (AAE). In this paper, we replicate existing analyses of dialect-sensitive stereotype generation in LLM outputs and investigate the effects of mitigation strategies, including prompt engineering (role-based and Chain-Of-Thought prompting) and multi-agent architectures composed of generate-critique-revise models. We define eight prompt templates to analyse different ways in which dialect bias can manifest, such as suggested names, jobs, and adjectives for SAE or AAE speakers. We use an LLM-as-judge approach to evaluate the bias in the results. Our results show that stereotype-bearing differences emerge between SAE- and AAE-related outputs across all template categories, with the strongest effects observed in adjective and job attribution. Baseline disparities vary substantially by model, with the largest SAE-AAE differential observed in Claude Haiku and the smallest in Phi-4 Mini. Chain-Of-Thought prompting proved to be an effective mitigation strategy for Claude Haiku, whereas the use of a multi-agent architecture ensured consistent mitigation across all the models. These findings suggest that for intersectionality-informed software engineering, fairness evaluation should include model-specific validation of mitigation strategies, and workflow-level controls (e.g., agentic architectures involving critique models) in high-impact LLM deployments. The current results are exploratory in nature and limited in scope, but can lead to extensions and replications by increasing the dataset size and applying the procedure to different languages or dialects.
Problem

Research questions and friction points this paper is trying to address.

linguistic stereotypes
dialect bias
large language models
discriminatory behavior
fairness in AI
Innovation

Methods, ideas, or system contributions that make the work stand out.

dialect bias
multi-agent architecture
Chain-of-Thought prompting
LLM-as-judge
stereotype mitigation
🔎 Similar Papers
No similar papers found.
M
Martina Ullasci
Politecnico di Torino
Marco Rondina
Marco Rondina
PhD Student, Nexa Center for Internet & Society, DAUIN, Politecnico di Torino
artificial intelligencemachine learningalgorithmic fairness
Riccardo Coppola
Riccardo Coppola
Assistant Professor with Time Contract, Politecnico di Torino
GamificationEthical AISoftware Testing
Flavio Giobergia
Flavio Giobergia
Politecnico di Torino
R
Riccardo Bellanca
Politecnico di Torino
G
Gabriele Mancari Pasi
Politecnico di Torino
L
Luca Prato
Politecnico di Torino
F
Federico Spinoso
Politecnico di Torino
S
Silvia Tagliente
Politecnico di Torino