🤖 AI Summary
Romansh, a low-resource language with five major dialects, lacks large-scale parallel corpora, hindering NLP development.
Method: We construct the first comprehensive, high-quality parallel corpus covering all five dialects, derived from 291 content-comparable textbooks. Automated text alignment yields 207,000 aligned segments (>2 million tokens). To ensure quality, we integrate comparable corpus mining with large language model–assisted alignment verification.
Contribution/Results: The resulting dataset—released under CC-BY-NC-SA—is the first open, reproducible resource for Romansh NLP. Human evaluation confirms high alignment accuracy, and empirical evaluation demonstrates strong performance on multi-dialect machine translation. This work bridges a critical data gap for Romansh, enabling foundational research and applications in multilingual and low-resource NLP.
📝 Abstract
The five idioms (i.e., varieties) of the Romansh language are largely standardized and are taught in the schools of the respective communities in Switzerland. In this paper, we present the first parallel corpus of Romansh idioms. The corpus is based on 291 schoolbook volumes, which are comparable in content for the five idioms. We use automatic alignment methods to extract 207k multi-parallel segments from the books, with more than 2M tokens in total. A small-scale human evaluation confirms that the segments are highly parallel, making the dataset suitable for NLP applications such as machine translation between Romansh idioms. We release the parallel and unaligned versions of the dataset under a CC-BY-NC-SA license and demonstrate its utility for machine translation by training and evaluating an LLM on a sample of the dataset.