KuBERT: Central Kurdish BERT Model and Its Application for Sentiment Analysis

📅 2025-09-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Central Kurdish faces severe resource scarcity, high linguistic diversity, and computational constraints, rendering conventional word embedding models (e.g., Word2Vec) inadequate for capturing semantic and contextual nuances. To address this, we introduce KurBERT—the first pre-trained BERT model specifically designed for Central Kurdish—trained on a large-scale monolingual corpus using a bidirectional Transformer architecture to learn deep contextualized representations. KurBERT achieves significant improvements in sentiment analysis, outperforming traditional methods on standard benchmark datasets and establishing the first authoritative evaluation baseline for Central Kurdish NLP. Our empirical results demonstrate the efficacy of deep contextual language models in low-resource Kurdish NLP tasks. Furthermore, the study provides a reproducible and scalable technical framework for adapting multilingual BERT to under-resourced languages, advancing both theoretical understanding and practical deployment in linguistically diverse, data-scarce settings.

Technology Category

Application Category

📝 Abstract
This paper enhances the study of sentiment analysis for the Central Kurdish language by integrating the Bidirectional Encoder Representations from Transformers (BERT) into Natural Language Processing techniques. Kurdish is a low-resourced language, having a high level of linguistic diversity with minimal computational resources, making sentiment analysis somewhat challenging. Earlier, this was done using a traditional word embedding model, such as Word2Vec, but with the emergence of new language models, specifically BERT, there is hope for improvements. The better word embedding capabilities of BERT lend to this study, aiding in the capturing of the nuanced semantic pool and the contextual intricacies of the language under study, the Kurdish language, thus setting a new benchmark for sentiment analysis in low-resource languages.
Problem

Research questions and friction points this paper is trying to address.

Enhancing sentiment analysis for low-resource Central Kurdish language
Addressing linguistic diversity with minimal computational resources
Improving word embedding beyond traditional models like Word2Vec
Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed KuBERT model for Central Kurdish language
Applied BERT embeddings to capture contextual nuances
Established sentiment analysis benchmark for low-resource languages
🔎 Similar Papers
No similar papers found.