Should LLMs, $\textit{like}$, Generate How Users Talk? Building Dialect-Accurate Dialog[ue]s Beyond the American Default with MDial

πŸ“… 2026-01-30
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This study addresses the challenge that large language models (LLMs) often generate stereotypical or inaccurate responses when interacting with speakers of non-standard American English, struggling to faithfully model multi-dialectal dialogue. To tackle this, the authors propose MDial, a framework developed in collaboration with native-speaking linguists that employs rule-based LLM transformations to generate dialogue data spanning nine English dialects, capturing lexical, orthographic, and grammatical features. The work makes the novel observation that LLMs should not fully replicate users’ grammatical patterns, finding that approximately 90% of dialectal grammar is unsuitable for imitation. The authors further introduce MDialBench, the first large-scale parallel multi-dialectal dialogue benchmark, comprising over 50,000 dialogues and 97,000 question-answer pairs, with 98% rated superior to existing data in human evaluations. Evaluations of 17 mainstream LLMs reveal dialect identification accuracy generally below 70%, with Canadian English recognition falling under 50%.

Technology Category

Application Category

πŸ“ Abstract
More than 80% of the 1.6 billion English speakers do not use Standard American English (SAE) and experience higher failure rates and stereotyped responses when interacting with LLMs as a result. Yet multi-dialectal performance remains underexplored. We introduce $\textbf{MDial}$, the first large-scale framework for generating multi-dialectal conversational data encompassing the three pillars of written dialect -- lexical (vocabulary), orthographic (spelling), and morphosyntactic (grammar) features -- for nine English dialects. Partnering with native linguists, we design an annotated and scalable rule-based LLM transformation to ensure precision. Our approach challenges the assumption that models should mirror users'morphosyntactic features, showing that up to 90% of the grammatical features of a dialect should not be reproduced by models. Independent evaluations confirm data quality, with annotators preferring MDial outputs over prior methods in 98% of pairwise comparisons for dialect naturalness. Using this pipeline, we construct the dialect-parallel $\textbf{MDialBench}$mark with 50k+ dialogs, resulting in 97k+ QA pairs, and evaluate 17 LLMs on dialect identification and response generation tasks. Even frontier models achieve under 70% accuracy, fail to reach 50% for Canadian English, and systematically misclassify non-SAE dialects as American or British. As dialect identification underpins natural language understanding, these errors risk cascading failures into downstream tasks.
Problem

Research questions and friction points this paper is trying to address.

dialect
large language models
non-standard English
dialogue generation
language bias
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-dialectal dialogue generation
dialect-aware LLMs
morphosyntactic transformation
MDialBench
non-Standard American English