KurdSTS: The Kurdish Semantic Textual Similarity

📅 2025-09-26

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Low-resource Kurdish lacks a semantic textual similarity (STS) benchmark, hindering evaluation of meaning-preserving language models. To address this critical gap, we introduce KurdiSTS—the first high-quality Kurdish STS dataset—comprising 10,000 sentence pairs drawn from authentic texts exhibiting formal and informal registers, rich morphological variation, orthographic inconsistencies, and code-mixing. All pairs are annotated with fine-grained similarity scores via rigorous double-blind human annotation. We establish reproducible baselines using Sentence-BERT, mBERT, and other multilingual encoders; experiments reveal substantial performance degradation compared to high-resource languages, confirming the dataset’s difficulty and diagnostic value. KurdiSTS fills a fundamental void in Kurdish semantic understanding and provides a methodological blueprint and foundational resource for STS research on morphologically complex low-resource languages.

Technology Category

Application Category

📝 Abstract

Semantic Textual Similarity (STS) measures the degree of meaning overlap between two texts and underpins many NLP tasks. While extensive resources exist for high-resource languages, low-resource languages such as Kurdish remain underserved. We present, to our knowledge, the first Kurdish STS dataset: 10,000 sentence pairs spanning formal and informal registers, each annotated for similarity. We benchmark Sentence-BERT, multilingual BERT, and other strong baselines, obtaining competitive results while highlighting challenges arising from Kurdish morphology, orthographic variation, and code-mixing. The dataset and baselines establish a reproducible evaluation suite and provide a strong starting point for future research on Kurdish semantics and low-resource NLP.

Problem

Research questions and friction points this paper is trying to address.

Creating the first Kurdish semantic textual similarity dataset

Addressing challenges from Kurdish morphology and orthography

Establishing benchmarks for low-resource Kurdish NLP tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

First Kurdish STS dataset with 10000 annotated pairs

Benchmarked Sentence-BERT and multilingual BERT models

Addressed Kurdish morphology and code-mixing challenges

🔎 Similar Papers

No similar papers found.