ViDia2Std: A Parallel Corpus and Methods for Low-Resource Vietnamese Dialect-to-Standard Translation

📅 2026-03-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the performance degradation of NLP systems trained on standard Vietnamese when applied to low-resource dialects, particularly in central and southern regions. To mitigate this issue, the authors construct the first manually annotated dialect-to-standard parallel corpus covering all 63 provinces of Vietnam and encompassing the three major dialect groups—North, Central, and South—with explicit inclusion of non-standard northern variants and southern dialects, making it the most comprehensive Vietnamese dialect resource to date. Built from authentic Facebook comments and annotated by native speakers, the corpus introduces a semantic mapping consistency metric to evaluate multi-dialect annotation quality. The authors employ mBART-large-50 and ViT5-base for dialect normalization, demonstrating that mBART-large-50 achieves state-of-the-art results with BLEU (0.8166), ROUGE-L (0.9384), and METEOR (0.8925) scores, and that dialect normalization substantially enhances downstream NLP task performance.

Technology Category

Application Category

📝 Abstract
Vietnamese exhibits extensive dialectal variation, posing challenges for NLP systems trained predominantly on standard Vietnamese. Such systems often underperform on dialectal inputs, especially from underrepresented Central and Southern regions. Previous work on dialect normalization has focused narrowly on Central-to-Northern dialect transfer using synthetic data and limited dialectal diversity. These efforts exclude Southern varieties and intra-regional variants within the North. We introduce ViDia2Std, the first manually annotated parallel corpus for dialect-to-standard Vietnamese translation covering all 63 provinces. Unlike prior datasets, ViDia2Std includes diverse dialects from Central, Southern, and non-standard Northern regions often absent from existing resources, making it the most dialectally inclusive corpus to date. The dataset consists of over 13,000 sentence pairs sourced from real-world Facebook comments and annotated by native speakers across all three dialect regions. To assess annotation consistency, we define a semantic mapping agreement metric that accounts for synonymous standard mappings across annotators. Based on this criterion, we report agreement rates of 86% (North), 82% (Central), and 85% (South). We benchmark several sequence-to-sequence models on ViDia2Std. mBART-large-50 achieves the best results (BLEU 0.8166, ROUGE-L 0.9384, METEOR 0.8925), while ViT5-base offers competitive performance with fewer parameters. ViDia2Std demonstrates that dialect normalization substantially improves downstream tasks, highlighting the need for dialect-aware resources in building robust Vietnamese NLP systems.
Problem

Research questions and friction points this paper is trying to address.

Vietnamese dialects
dialect normalization
low-resource NLP
parallel corpus
dialect-to-standard translation
Innovation

Methods, ideas, or system contributions that make the work stand out.

dialect normalization
parallel corpus
low-resource NLP
Vietnamese dialects
semantic mapping agreement
🔎 Similar Papers
No similar papers found.
K
Khoa Anh Ta
Faculty of Information Science and Engineering, University of Information Technology, Ho Chi Minh City, Vietnam; Vietnam National University, Ho Chi Minh City, Vietnam
Nguyen Van Dinh
Nguyen Van Dinh
Assoc. Prof. PhD, Vietnam National University of Agriculture.
Soft Computing based on Rough Set and Fuzzy Set Theories.
Kiet Van Nguyen
Kiet Van Nguyen
University of Information Technology, VNU-HCM
Data ScienceArtificial IntelligenceComputational Linguistics