🤖 AI Summary
This work addresses the automatic diacritization challenge for Yoruba—a low-resource tonal language—by introducing YAD, the first dedicated benchmark dataset for Yoruba diacritization, and Yoruba-T5, the first monolingual T5 model pretrained specifically for Yoruba. Methodologically, we adopt the T5 text-to-text framework, performing monolingual pretraining on Yoruba corpora followed by supervised fine-tuning on YAD, with systematic evaluation of diacritization performance. Our key contributions are: (1) releasing YAD—a high-quality, human-verified diacritization benchmark; (2) open-sourcing the first monolingual T5 model designed explicitly for tonal language diacritization; and (3) empirically demonstrating that both model capacity and training data scale exert significant positive effects on diacritization accuracy. Experiments show that Yoruba-T5 substantially outperforms the multilingual mT5 baseline on YAD, establishing a reproducible, language-specific paradigm for phonological normalization in low-resource tonal languages.
📝 Abstract
In this work, we present Yor`ub'a automatic diacritization (YAD) benchmark dataset for evaluating Yor`ub'a diacritization systems. In addition, we pre-train text-to-text transformer, T5 model for Yor`ub'a and showed that this model outperform several multilingually trained T5 models. Lastly, we showed that more data and larger models are better at diacritization for Yor`ub'a