🤖 AI Summary
This work addresses the automatic terminology extraction task for Slovenian, a low-resource language. We propose an SVM-based classification method that integrates contextual word embeddings (BERT-like), statistical features (frequency, token length), and linguistic features (part-of-speech tags, dependency relations). Crucially, this is the first study to incorporate contextual embeddings into Slovenian terminology identification. Instead of relying on handcrafted part-of-speech patterns, our approach automatically learns domain-adapted candidate generation rules and contextual representations from the newly constructed, manually annotated corpus RSDO5. We evaluate the method end-to-end across four domains—biomechanics, linguistics, chemistry, and veterinary medicine—and achieve statistically significant improvements in F1-score over the current state of the art. Results demonstrate that contextual embeddings critically enhance terminology extraction performance. The proposed framework establishes a transferable technical paradigm for terminology identification in resource-scarce languages.
📝 Abstract
Automated terminology extraction refers to the task of extracting meaningful terms from domain-specific texts. This paper proposes a novel machine learning approach to terminology extraction, which combines features from traditional term extraction systems with novel contextual features derived from contextual word embeddings. Instead of using a predefined list of part-of-speech patterns, we first analyse a new term-annotated corpus RSDO5 for the Slovenian language and devise a set of rules for term candidate selection and then generate statistical, linguistic and context-based features. We use a support-vector machine algorithm to train a classification model, evaluate it on the four domains (biomechanics, linguistics, chemistry, veterinary) of the RSDO5 corpus and compare the results with state-of-art term extraction approaches for the Slovenian language. Our approach provides significant improvements in terms of F1 score over the previous state-of-the-art, which proves that contextual word embeddings are valuable for improving term extraction.