🤖 AI Summary
This study investigates the capacity of mainstream large language models (LLMs)—including Llama, GPT, and Claude—to generate African American Vernacular English (AAVE) and examines how such generation affects user trust and role appropriateness in healthcare and education contexts. Method: Leveraging prompt engineering to systematically modulate AAVE intensity, we conduct multi-dimensional subjective evaluations—assessing credibility, professionalism, naturalness, and more—via expert annotation and crowdsourced user studies. Contribution/Results: Contrary to prevailing assumptions, we find that AAVE-speaking users consistently prefer standard American English (SAE)-generated responses; improving AAVE generation quality does not enhance user experience, and higher AAVE intensity correlates with significantly lower subjective ratings across all dimensions. This constitutes the first empirical study to systematically refute the necessity of dialectal adaptation for inclusive AI, demonstrating that linguistic surface alignment alone is insufficient—and potentially detrimental—without grounding in authentic user preferences.
📝 Abstract
As chatbots become increasingly integrated into everyday tasks, designing systems that accommodate diverse user populations is crucial for fostering trust, engagement, and inclusivity. This study investigates the ability of contemporary Large Language Models (LLMs) to generate African American Vernacular English (AAVE) and evaluates the impact of AAVE usage on user experiences in chatbot applications. We analyze the performance of three LLM families (Llama, GPT, and Claude) in producing AAVE-like utterances at varying dialect intensities and assess user preferences across multiple domains, including healthcare and education. Despite LLMs' proficiency in generating AAVE-like language, findings indicate that AAVE-speaking users prefer Standard American English (SAE) chatbots, with higher levels of AAVE correlating with lower ratings for a variety of characteristics, including chatbot trustworthiness and role appropriateness. These results highlight the complexities of creating inclusive AI systems and underscore the need for further exploration of diversity to enhance human-computer interactions.