KuBERT: Central Kurdish BERT Model and Its Application for Sentiment Analysis

📅 2025-09-20

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Central Kurdish faces severe resource scarcity, high linguistic diversity, and computational constraints, rendering conventional word embedding models (e.g., Word2Vec) inadequate for capturing semantic and contextual nuances. To address this, we introduce KurBERT—the first pre-trained BERT model specifically designed for Central Kurdish—trained on a large-scale monolingual corpus using a bidirectional Transformer architecture to learn deep contextualized representations. KurBERT achieves significant improvements in sentiment analysis, outperforming traditional methods on standard benchmark datasets and establishing the first authoritative evaluation baseline for Central Kurdish NLP. Our empirical results demonstrate the efficacy of deep contextual language models in low-resource Kurdish NLP tasks. Furthermore, the study provides a reproducible and scalable technical framework for adapting multilingual BERT to under-resourced languages, advancing both theoretical understanding and practical deployment in linguistically diverse, data-scarce settings.

Technology Category

Application Category

📝 Abstract

This paper enhances the study of sentiment analysis for the Central Kurdish language by integrating the Bidirectional Encoder Representations from Transformers (BERT) into Natural Language Processing techniques. Kurdish is a low-resourced language, having a high level of linguistic diversity with minimal computational resources, making sentiment analysis somewhat challenging. Earlier, this was done using a traditional word embedding model, such as Word2Vec, but with the emergence of new language models, specifically BERT, there is hope for improvements. The better word embedding capabilities of BERT lend to this study, aiding in the capturing of the nuanced semantic pool and the contextual intricacies of the language under study, the Kurdish language, thus setting a new benchmark for sentiment analysis in low-resource languages.

Problem

Research questions and friction points this paper is trying to address.

Enhancing sentiment analysis for low-resource Central Kurdish language

Addressing linguistic diversity with minimal computational resources

Improving word embedding beyond traditional models like Word2Vec

Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed KuBERT model for Central Kurdish language

Applied BERT embeddings to capture contextual nuances

Established sentiment analysis benchmark for low-resource languages

🔎 Similar Papers

No similar papers found.