🤖 AI Summary
This study addresses the challenges of lemmatization, dialect identification, and language classification for Romansh—a language encompassing five major dialects and the standardized variety Rumantsch Grischun—by proposing the first unified framework covering all its variants. Leveraging a community-built morphological database, the authors develop a variant-aware, lexicon-based lemmatizer that employs rule-based matching to achieve high lemmatization coverage. This approach is further extended to support dialect identification and language boundary detection. Experimental results demonstrate lemmatization coverage of 77–84% on typical texts, dialect identification accuracy of 95% across a corpus of 30,000 documents, and effective discrimination between Romansh and non-Romansh texts.
📝 Abstract
Lemmatization -- the task of mapping an inflected word form to its dictionary form -- is a crucial component of many NLP applications. In this paper, we present RUMLEM, a lemmatizer that covers the five main varieties of Romansh as well as the supra-regional standard variety Rumantsch Grischun. It is based on comprehensive, community-driven morphological databases for Romansh, enabling RUMLEM to cover 77-84% of the words in a typical Romansh text. Since there is a dedicated database for each Romansh variety, an additional application of RUMLEM is variety-aware language classification. Evaluation on 30'000 Romansh texts of varying lengths shows that RUMLEM correctly identifies the variety in 95% of cases. In addition, a proof of concept demonstrates the feasibility of Romansh vs. non-Romansh language classification based on the lemmatizer.