Align and Shine: Building High-Quality Sentence-Aligned Corpora for Multilingual Text Simplification

📅 2026-05-10
📈 Citations: 0
Influential: 0
📄 PDF

career value

172K/year
🤖 AI Summary
This work addresses the scarcity of large-scale, high-quality sentence-aligned corpora for text simplification in non-English languages by systematically constructing and publicly releasing a multilingual simplification corpus covering Catalan, English, French, Italian, and Spanish. Leveraging crowdsourcing, the authors collect simplified texts from comparable documents and implement a document-to-sentence alignment mechanism to produce high-quality sentence pairs. This resource fills a critical gap in non-English simplification data and provides a foundational benchmark for training and evaluating multilingual text simplification systems.
📝 Abstract
Text simplification plays a crucial role in improving the accessibility and comprehensibility of written information for diverse audiences, including language learners and readers with limited literacy. Despite its importance, large-scale, high-quality datasets for training and evaluating text simplification models remain scarce for languages other than English. This paper reports an experimental study on the collection and processing of crowd-sourced simplification data from comparable corpora to construct a corpus suitable for both training and testing text simplification systems across multiple languages (Catalan, English, French, Italian and Spanish). We report mechanisms for sentence-level alignment from document-level data. The resulting dataset of the aligned sentence pairs is publicly available.
Problem

Research questions and friction points this paper is trying to address.

text simplification
multilingual corpora
sentence alignment
dataset scarcity
non-English languages
Innovation

Methods, ideas, or system contributions that make the work stand out.

sentence-level alignment
multilingual text simplification
comparable corpora
crowd-sourced simplification data
high-quality aligned corpus
🔎 Similar Papers
No similar papers found.