Semantic change detection for Slovene language: a novel dataset and an approach based on optimal transport

📅 2024-02-26

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

199K/year

🤖 AI Summary

Semantic change detection remains underexplored for low-resource Slavic languages, particularly Slovenian (~2 million speakers), due to the absence of dedicated evaluation benchmarks. Method: We introduce the first curated benchmark for Slovenian semantic change detection—comprising 104 target words and over 3,000 human-annotated sentence pairs—and propose an end-to-end framework grounded in optimal transport theory. This framework jointly models dynamic word embedding alignment, context-aware vector aggregation, and cross-temporal semantic distribution comparison, circumventing limitations of heuristic distance metrics and static pre-trained embeddings. Contribution/Results: On our benchmark, the method reduces prediction error by 22.8% relative to the prior state of the art. This work establishes the first evaluation resource for semantic evolution in a low-resource Slavic language, accompanied by a fully reproducible methodology and open-source toolchain, thereby advancing computational language evolution research into linguistically under-resourced settings.

Technology Category

Application Category

📝 Abstract

In this paper, we focus on the detection of semantic changes in Slovene, a less resourced Slavic language with two million speakers. Detecting and tracking semantic changes provides insights into the evolution of the language caused by changes in society and culture. Recently, several systems have been proposed to aid in this study, but all depend on manually annotated gold standard datasets for evaluation. In this paper, we present the first Slovene dataset for evaluating semantic change detection systems, which contains aggregated semantic change scores for 104 target words obtained from more than 3000 manually annotated sentence pairs. We evaluate several existing semantic change detection methods on this dataset and also propose a novel approach based on optimal transport that improves on the existing state-of-the-art systems with an error reduction rate of 22.8%.

Problem

Research questions and friction points this paper is trying to address.

Detecting semantic changes in under-resourced Slovene language

Evaluating limitations of current semantic change metrics

Proposing robust optimal transport-based metric for semantic change

Innovation

Methods, ideas, or system contributions that make the work stand out.

First Slovene dataset for semantic change detection

Novel metric using regularized optimal transport

Improved performance over baseline approaches

🔎 Similar Papers

No similar papers found.