π€ AI Summary
This study addresses the challenge of effectively distinguishing between the multiple regional dialects of Romansh and its supra-regional standard variety, Rumantsch Grischunβa task that existing language identification systems struggle to perform accurately. To this end, the work proposes a support vector machine (SVM)-based approach for Romansh dialect identification and introduces, for the first time, a benchmark dataset encompassing both dialectal and Rumantsch Grischun texts across two distinct textual domains. Experimental results demonstrate that the proposed system achieves an average accuracy of 97% in in-domain evaluations, substantially advancing downstream applications such as dialect-aware spell checking and machine translation.
π Abstract
The Romansh language has several regional varieties, called idioms, which sometimes have limited mutual intelligibility. Despite this linguistic diversity, there has been a lack of documented efforts to build a language identification (LID) system that can distinguish between these idioms. Since Romansh LID should also be able to recognize Rumantsch Grischun, a supra-regional variety that combines elements of several idioms, this makes for a novel and interesting classification problem. In this paper, we present a LID system for Romansh idioms based on an SVM approach. We evaluate our model on a newly curated benchmark across two domains and find that it reaches an average in-domain accuracy of 97%, enabling applications such as idiom-aware spell checking or machine translation. Our classifier is publicly available.