🤖 AI Summary
This study addresses the significant phonological differences among the three major Vietnamese dialect regions, which complicate the grapheme-to-phoneme mapping in automatic speech recognition and introduce strong dialect dependency. To tackle this challenge, the authors propose a dialect-aware phonological modeling framework that structurally decomposes syllables into phonemic components, constructs dialect-specific IPA mappings, and employs a joint phonological decoder to explicitly model dialectal variants. The approach requires no external pretraining and uses substantially fewer parameters than mainstream pretrained models. Evaluated on the public multidialectal dataset UIT-ViMD, the method outperforms multiple baselines and matches the performance of wav2vec2-base-vi-250h while using significantly fewer parameters.
📝 Abstract
Vietnamese exhibits substantial dialectal phonetic variation across Northern, Central, and Southern regions, where identical lexical items may be realized with markedly different pronunciations. Such variation poses challenges for automatic speech recognition (ASR) and remains difficult to model computationally due to the complex relationship between Vietnamese orthography and phonology. Existing approaches typically address dialect variability at the word level, assuming dialect-invariant mappings between spelling and pronunciation, which limits their ability to capture systematic phonetic differences. We propose a dialect-aware phonetic framework that explicitly models Vietnamese phonological structure and dialectal variation at both the vocabulary and decoding levels. The framework introduces a phonetic vocabulary that decomposes each syllable into structured phonetic components and maps them to dialect-specific IPA representations, together with a phonetic-structure decoder that jointly predicts these components. Experiments on the UIT-ViMD, a only-available dataset for multi-dialect in Vietnamese, show that the proposed approach outperforms various pre-trained baselines, \textbf{especially matches the performance of the strongest pretrained wav2ve2-base-vi-250h} across dialects while \textbf{using substantially fewer parameters and no external pretraining}. Code for experimental reproducibility will be publicly available upon the acceptance of this paper.