Extracting domain-specific terms using contextual word embeddings

📅 2025-02-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the automatic terminology extraction task for Slovenian, a low-resource language. We propose an SVM-based classification method that integrates contextual word embeddings (BERT-like), statistical features (frequency, token length), and linguistic features (part-of-speech tags, dependency relations). Crucially, this is the first study to incorporate contextual embeddings into Slovenian terminology identification. Instead of relying on handcrafted part-of-speech patterns, our approach automatically learns domain-adapted candidate generation rules and contextual representations from the newly constructed, manually annotated corpus RSDO5. We evaluate the method end-to-end across four domains—biomechanics, linguistics, chemistry, and veterinary medicine—and achieve statistically significant improvements in F1-score over the current state of the art. Results demonstrate that contextual embeddings critically enhance terminology extraction performance. The proposed framework establishes a transferable technical paradigm for terminology identification in resource-scarce languages.

Technology Category

Application Category

📝 Abstract
Automated terminology extraction refers to the task of extracting meaningful terms from domain-specific texts. This paper proposes a novel machine learning approach to terminology extraction, which combines features from traditional term extraction systems with novel contextual features derived from contextual word embeddings. Instead of using a predefined list of part-of-speech patterns, we first analyse a new term-annotated corpus RSDO5 for the Slovenian language and devise a set of rules for term candidate selection and then generate statistical, linguistic and context-based features. We use a support-vector machine algorithm to train a classification model, evaluate it on the four domains (biomechanics, linguistics, chemistry, veterinary) of the RSDO5 corpus and compare the results with state-of-art term extraction approaches for the Slovenian language. Our approach provides significant improvements in terms of F1 score over the previous state-of-the-art, which proves that contextual word embeddings are valuable for improving term extraction.
Problem

Research questions and friction points this paper is trying to address.

Extract domain-specific terms using machine learning
Combine traditional and contextual embedding features
Improve term extraction accuracy for Slovenian language
Innovation

Methods, ideas, or system contributions that make the work stand out.

Contextual word embeddings integration
Support-vector machine classification model
Multi-domain term extraction improvement
🔎 Similar Papers
No similar papers found.