The Mediomatix Corpus: Parallel Data for Romansh Idioms via Comparable Schoolbooks

📅 2025-08-22

📈 Citations: 0

✨ Influential: 0

career value

157K/year

🤖 AI Summary

Romansh, a low-resource language with five major dialects, lacks large-scale parallel corpora, hindering NLP development. Method: We construct the first comprehensive, high-quality parallel corpus covering all five dialects, derived from 291 content-comparable textbooks. Automated text alignment yields 207,000 aligned segments (>2 million tokens). To ensure quality, we integrate comparable corpus mining with large language model–assisted alignment verification. Contribution/Results: The resulting dataset—released under CC-BY-NC-SA—is the first open, reproducible resource for Romansh NLP. Human evaluation confirms high alignment accuracy, and empirical evaluation demonstrates strong performance on multi-dialect machine translation. This work bridges a critical data gap for Romansh, enabling foundational research and applications in multilingual and low-resource NLP.

Technology Category

Application Category

📝 Abstract

The five idioms (i.e., varieties) of the Romansh language are largely standardized and are taught in the schools of the respective communities in Switzerland. In this paper, we present the first parallel corpus of Romansh idioms. The corpus is based on 291 schoolbook volumes, which are comparable in content for the five idioms. We use automatic alignment methods to extract 207k multi-parallel segments from the books, with more than 2M tokens in total. A small-scale human evaluation confirms that the segments are highly parallel, making the dataset suitable for NLP applications such as machine translation between Romansh idioms. We release the parallel and unaligned versions of the dataset under a CC-BY-NC-SA license and demonstrate its utility for machine translation by training and evaluating an LLM on a sample of the dataset.

Problem

Research questions and friction points this paper is trying to address.

Creating first parallel corpus for five Romansh language idioms

Extracting aligned segments from comparable schoolbook content

Enabling machine translation between Romansh varieties through NLP

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automatic alignment of comparable schoolbooks

Extracting multi-parallel segments from texts

Training LLM for Romansh machine translation

🔎 Similar Papers

Improving LLM Abilities in Idiomatic Translation