🤖 AI Summary
This work addresses the scarcity of large-scale, high-quality sentence-aligned corpora for text simplification in non-English languages by systematically constructing and publicly releasing a multilingual simplification corpus covering Catalan, English, French, Italian, and Spanish. Leveraging crowdsourcing, the authors collect simplified texts from comparable documents and implement a document-to-sentence alignment mechanism to produce high-quality sentence pairs. This resource fills a critical gap in non-English simplification data and provides a foundational benchmark for training and evaluating multilingual text simplification systems.
📝 Abstract
Text simplification plays a crucial role in improving the accessibility and comprehensibility of written information for diverse audiences, including language learners and readers with limited literacy. Despite its importance, large-scale, high-quality datasets for training and evaluating text simplification models remain scarce for languages other than English. This paper reports an experimental study on the collection and processing of crowd-sourced simplification data from comparable corpora to construct a corpus suitable for both training and testing text simplification systems across multiple languages (Catalan, English, French, Italian and Spanish). We report mechanisms for sentence-level alignment from document-level data. The resulting dataset of the aligned sentence pairs is publicly available.