🤖 AI Summary
This work proposes a phoneme-level end-to-end automatic speech recognition (ASR) approach for Vietnamese that explicitly incorporates syllabic phonological constraints during decoding to generate valid syllables from a compact phoneme inventory. Unlike conventional ASR systems that rely on character- or subword-level units and require large vocabularies, the proposed method leverages the intricate syllable structure of Vietnamese without needing additional training data or pretrained models. Evaluated on the LSVSC and UIT-ViMD benchmarks, the system outperforms strong baselines such as PhoWhisper and Wav2Vec2, achieving higher recognition accuracy with a substantially reduced vocabulary size. Notably, it demonstrates robust performance across multiple dialects, highlighting its effectiveness in handling linguistic variation inherent in Vietnamese speech.
📝 Abstract
Most Automatic Speech Recognition (ASR) systems formulate transcription as a prediction problem over orthographic units such as characters, subwords, or words. Although effective, such representations do not explicitly reflect the phonetic structure of speech and often require large vocabularies to maintain adequate coverage. In this work, we are motivated from the phonemic features of Vietnamese to propose a Syllabic-Structure Decoder for ASR, which models speech at the phoneme level instead of the orthographic level. Our approach explicitly captures the phonological composition of syllables, enabling the decoder to generate valid syllabic structures from a compact phonemic inventory. This design more closely aligns with the phonetic realization of speech while significantly reducing vocabulary size. Experimental results on two benchmarks: LSVSC, representing standard speech, and UIT-ViMD, a multi-dialect corpus containing diverse regional pronunciations, show that our method consistently outperforms strong previous baselines, especially pretrained baselines such as PhoWhisper and Wav2Vec2, despite using a substantially smaller vocabulary and no additional training resources. These results highlight the effectiveness of phoneme-based syllabic modeling for ASR in this language. Code for experimental reproducibility will be publicly available upon the acceptance of this paper.