🤖 AI Summary
This paper addresses syllable sequence reconstruction under incomplete input for abugida scripts—Bengali, Hindi, Khmer, Lao, Burmese, and Thai—considering four missingness patterns: consonant sequences, vowel sequences, random character deletion, and fixed-syllable masking. We propose a multilingual Transformer-based sequence-to-sequence model jointly trained on the Asian Language Treebank (ALT). To our knowledge, this is the first systematic cross-lingual evaluation of reconstruction performance across diverse abugida writing systems and missingness types. Results show that consonant sequences serve as the strongest predictive cue, whereas vowel recovery exhibits inherent structural challenges. The model achieves high BLEU scores on consonant-driven tasks and demonstrates robustness in partial and masked-syllable reconstruction. These findings provide a transferable, practical foundation for low-resource text prediction, spelling correction, and data augmentation in abugida languages.
📝 Abstract
This paper explores syllable sequence prediction in Abugida languages using Transformer-based models, focusing on six languages: Bengali, Hindi, Khmer, Lao, Myanmar, and Thai, from the Asian Language Treebank (ALT) dataset. We investigate the reconstruction of complete syllable sequences from various incomplete input types, including consonant sequences, vowel sequences, partial syllables (with random character deletions), and masked syllables (with fixed syllable deletions). Our experiments reveal that consonant sequences play a critical role in accurate syllable prediction, achieving high BLEU scores, while vowel sequences present a significantly greater challenge. The model demonstrates robust performance across tasks, particularly in handling partial and masked syllable reconstruction, with strong results for tasks involving consonant information and syllable masking. This study advances the understanding of sequence prediction for Abugida languages and provides practical insights for applications such as text prediction, spelling correction, and data augmentation in these scripts.