🤖 AI Summary
Central Kurdish faces severe resource scarcity, high linguistic diversity, and computational constraints, rendering conventional word embedding models (e.g., Word2Vec) inadequate for capturing semantic and contextual nuances. To address this, we introduce KurBERT—the first pre-trained BERT model specifically designed for Central Kurdish—trained on a large-scale monolingual corpus using a bidirectional Transformer architecture to learn deep contextualized representations. KurBERT achieves significant improvements in sentiment analysis, outperforming traditional methods on standard benchmark datasets and establishing the first authoritative evaluation baseline for Central Kurdish NLP. Our empirical results demonstrate the efficacy of deep contextual language models in low-resource Kurdish NLP tasks. Furthermore, the study provides a reproducible and scalable technical framework for adapting multilingual BERT to under-resourced languages, advancing both theoretical understanding and practical deployment in linguistically diverse, data-scarce settings.
📝 Abstract
This paper enhances the study of sentiment analysis for the Central Kurdish language by integrating the Bidirectional Encoder Representations from Transformers (BERT) into Natural Language Processing techniques. Kurdish is a low-resourced language, having a high level of linguistic diversity with minimal computational resources, making sentiment analysis somewhat challenging. Earlier, this was done using a traditional word embedding model, such as Word2Vec, but with the emergence of new language models, specifically BERT, there is hope for improvements. The better word embedding capabilities of BERT lend to this study, aiding in the capturing of the nuanced semantic pool and the contextual intricacies of the language under study, the Kurdish language, thus setting a new benchmark for sentiment analysis in low-resource languages.