🤖 AI Summary
Uzbek, a low-resource, morphologically rich language, lacks publicly available universal part-of-speech (UPOS) annotation resources and benchmark datasets for POS tagging.
Method: We construct the first open UPOS-annotated benchmark dataset for Uzbek following Universal Dependencies guidelines, fine-tune two monolingual Uzbek BERT models on this data, and systematically evaluate their performance against multilingual BERT and a rule-based tagger.
Contribution/Results: Fine-tuned monolingual Uzbek BERT achieves an average accuracy of 91%, substantially outperforming all baselines. This work provides the first empirical validation that monolingual pretraining effectively captures suffix-driven POS variation and context-sensitive morphology—capabilities beyond the reach of traditional rule-based systems. It establishes the first publicly available UPOS benchmark for Uzbek, fills a critical gap in Uzbek NLP infrastructure, and offers a reproducible evaluation framework and effective methodology for POS tagging in low-resource, morphologically complex languages.
📝 Abstract
This paper advances NLP research for the low-resource Uzbek language by evaluating two previously untested monolingual Uzbek BERT models on the part-of-speech (POS) tagging task and introducing the first publicly available UPOS-tagged benchmark dataset for Uzbek. Our fine-tuned models achieve 91% average accuracy, outperforming the baseline multi-lingual BERT as well as the rule-based tagger. Notably, these models capture intermediate POS changes through affixes and demonstrate context sensitivity, unlike existing rule-based taggers.