The Mediomatix Corpus: Parallel Data for Romansh Idioms via Comparable Schoolbooks

📅 2025-08-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Romansh, a low-resource language with five major dialects, lacks large-scale parallel corpora, hindering NLP development. Method: We construct the first comprehensive, high-quality parallel corpus covering all five dialects, derived from 291 content-comparable textbooks. Automated text alignment yields 207,000 aligned segments (>2 million tokens). To ensure quality, we integrate comparable corpus mining with large language model–assisted alignment verification. Contribution/Results: The resulting dataset—released under CC-BY-NC-SA—is the first open, reproducible resource for Romansh NLP. Human evaluation confirms high alignment accuracy, and empirical evaluation demonstrates strong performance on multi-dialect machine translation. This work bridges a critical data gap for Romansh, enabling foundational research and applications in multilingual and low-resource NLP.

Technology Category

Application Category

📝 Abstract
The five idioms (i.e., varieties) of the Romansh language are largely standardized and are taught in the schools of the respective communities in Switzerland. In this paper, we present the first parallel corpus of Romansh idioms. The corpus is based on 291 schoolbook volumes, which are comparable in content for the five idioms. We use automatic alignment methods to extract 207k multi-parallel segments from the books, with more than 2M tokens in total. A small-scale human evaluation confirms that the segments are highly parallel, making the dataset suitable for NLP applications such as machine translation between Romansh idioms. We release the parallel and unaligned versions of the dataset under a CC-BY-NC-SA license and demonstrate its utility for machine translation by training and evaluating an LLM on a sample of the dataset.
Problem

Research questions and friction points this paper is trying to address.

Creating first parallel corpus for five Romansh language idioms
Extracting aligned segments from comparable schoolbook content
Enabling machine translation between Romansh varieties through NLP
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automatic alignment of comparable schoolbooks
Extracting multi-parallel segments from texts
Training LLM for Romansh machine translation
🔎 Similar Papers
No similar papers found.
Z
Zachary Hopton
University of Zurich
Jannis Vamvas
Jannis Vamvas
University of Zurich
A
Andrin Büchler
University of Teacher Education of the Grisons
Anna Rutkiewicz
Anna Rutkiewicz
University of Zurich
R
Rico Cathomas
University of Teacher Education of the Grisons
R
Rico Sennrich
University of Zurich