🤖 AI Summary
This study addresses the limited robustness of existing harmful content detectors, which are predominantly developed for Standard American English and exhibit systemic biases against speakers of non-standard dialects across the globe. To tackle this issue, the authors propose DIA-HARM, a novel evaluation framework, and introduce D3—the first benchmark corpus spanning 50 English dialects with 195,000 samples—augmented via a linguistically grounded Multi-VALUE transformation method. Through systematic evaluation of 16 models, the work reveals significant performance degradation in multialectal settings: human-written dialectal content reduces F1 scores by 1.4–3.6%, with some models suffering over 33% drops on mixed-content inputs. Among all models tested, the multilingual mDeBERTa achieves the strongest performance (average F1: 97.2%), substantially outperforming monolingual and zero-shot large language models, thereby advancing equitable and inclusive content moderation technologies.
📝 Abstract
Harmful content detectors-particularly disinformation classifiers-are predominantly developed and evaluated on Standard American English (SAE), leaving their robustness to dialectal variation unexplored. We present DIA-HARM, the first benchmark for evaluating disinformation detection robustness across 50 English dialects spanning U.S., British, African, Caribbean, and Asia-Pacific varieties. Using Multi-VALUE's linguistically grounded transformations, we introduce D3 (Dialectal Disinformation Detection), a corpus of 195K samples derived from established disinformation benchmarks. Our evaluation of 16 detection models reveals systematic vulnerabilities: human-written dialectal content degrades detection by 1.4-3.6% F1, while AI-generated content remains stable. Fine-tuned transformers substantially outperform zero-shot LLMs (96.6% vs. 78.3% best-case F1), with some models exhibiting catastrophic failures exceeding 33% degradation on mixed content. Cross-dialectal transfer analysis across 2,450 dialect pairs shows that multilingual models (mDeBERTa: 97.2% average F1) generalize effectively, while monolingual models like RoBERTa and XLM-RoBERTa fail on dialectal inputs. These findings demonstrate that current disinformation detectors may systematically disadvantage hundreds of millions of non-SAE speakers worldwide. We release the DIA-HARM framework, D3 corpus, and evaluation tools: https://github.com/jsl5710/dia-harm