🤖 AI Summary
Low-resource Kurdish lacks a semantic textual similarity (STS) benchmark, hindering evaluation of meaning-preserving language models. To address this critical gap, we introduce KurdiSTS—the first high-quality Kurdish STS dataset—comprising 10,000 sentence pairs drawn from authentic texts exhibiting formal and informal registers, rich morphological variation, orthographic inconsistencies, and code-mixing. All pairs are annotated with fine-grained similarity scores via rigorous double-blind human annotation. We establish reproducible baselines using Sentence-BERT, mBERT, and other multilingual encoders; experiments reveal substantial performance degradation compared to high-resource languages, confirming the dataset’s difficulty and diagnostic value. KurdiSTS fills a fundamental void in Kurdish semantic understanding and provides a methodological blueprint and foundational resource for STS research on morphologically complex low-resource languages.
📝 Abstract
Semantic Textual Similarity (STS) measures the degree of meaning overlap between two texts and underpins many NLP tasks. While extensive resources exist for high-resource languages, low-resource languages such as Kurdish remain underserved. We present, to our knowledge, the first Kurdish STS dataset: 10,000 sentence pairs spanning formal and informal registers, each annotated for similarity. We benchmark Sentence-BERT, multilingual BERT, and other strong baselines, obtaining competitive results while highlighting challenges arising from Kurdish morphology, orthographic variation, and code-mixing. The dataset and baselines establish a reproducible evaluation suite and provide a strong starting point for future research on Kurdish semantics and low-resource NLP.