SlovKE: A Large-Scale Dataset and LLM Evaluation for Slovak Keyphrase Extraction

📅 2026-03-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the scarcity of large-scale evaluation benchmarks for keyphrase extraction in morphologically rich yet resource-poor languages such as Slovak. To bridge this gap, we introduce SlovKE, a new dataset comprising 227,432 scholarly abstracts—marking the first Slovak benchmark approaching the scale of mainstream English datasets. We systematically evaluate unsupervised methods including YAKE, TextRank, and KeyBERT (paired with SlovakBERT), alongside KeyLLM, a GPT-3.5-turbo–based approach. Experimental results reveal that traditional methods suffer from morphological mismatch, achieving a maximum F1@6 of only 11.6%. In contrast, KeyLLM substantially narrows the gap between exact and partial matching metrics, and human evaluation (κ = 0.61) confirms its ability to generate keyphrases more aligned with author-provided annotations, thereby mitigating the underestimation inherent in automatic evaluation metrics.

Technology Category

Application Category

📝 Abstract
Keyphrase extraction for morphologically rich, low-resource languages remains understudied, largely due to the scarcity of suitable evaluation datasets. We address this gap for Slovak by constructing a dataset of 227,432 scientific abstracts with author-assigned keyphrases -- scraped and systematically cleaned from the Slovak Central Register of Theses -- representing a 25-fold increase over the largest prior Slovak resource and approaching the scale of established English benchmarks such as KP20K. Using this dataset, we benchmark three unsupervised baselines (YAKE, TextRank, KeyBERT with SlovakBERT embeddings) and evaluate KeyLLM, an LLM-based extraction method using GPT-3.5-turbo. Unsupervised baselines achieve at most 11.6\% exact-match $F1@6$, with a large gap to partial matching (up to 51.5\%), reflecting the difficulty of matching inflected surface forms to author-assigned keyphrases. KeyLLM narrows this exact--partial gap, producing keyphrases closer to the canonical forms assigned by authors, while manual evaluation on 100 documents ($κ= 0.61$) confirms that KeyLLM captures relevant concepts that automated exact matching underestimates. Our analysis identifies morphological mismatch as the dominant failure mode for statistical methods -- a finding relevant to other inflected languages. The dataset (https://huggingface.co/datasets/NaiveNeuron/SlovKE) and evaluation code (https://github.com/NaiveNeuron/SlovKE) are publicly available.
Problem

Research questions and friction points this paper is trying to address.

keyphrase extraction
low-resource languages
morphologically rich languages
Slovak
evaluation dataset
Innovation

Methods, ideas, or system contributions that make the work stand out.

keyphrase extraction
low-resource languages
morphologically rich languages
large language models
dataset construction
🔎 Similar Papers
No similar papers found.