SMOL: Professionally translated parallel data for 115 under-represented languages

📅 2025-02-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Prior to this work, no high-quality parallel corpora existed for 115 low-resource languages (LRLs), severely hindering machine translation (MT) development and evaluation. Method: We introduce SMOL, the first high-quality, dual-granularity (sentence-level SMOL-Sent and document-level SMOL-Doc), factuality-annotated parallel corpus for these LRLs (6.1M target tokens). We propose the “Maximum Coverage Leverage” principle for data selection—optimizing token volume, topical breadth, and paragraph-structural diversity—and integrate professional human translation, multi-dimensional language coverage optimization, human-factuality assessment, and LLM-assisted verification. Contribution/Results: SMOL substantially improves multilingual MT performance (ChrF). It underpins GATITOS-SMOL, the first LRL MT benchmark supporting word-, sentence-, and document-level evaluation with explicit factuality supervision—enabling holistic, fact-aware translation assessment.

Technology Category

Application Category

📝 Abstract
We open-source SMOL (Set of Maximal Overall Leverage), a suite of training data to unlock translation for low-resource languages (LRLs). SMOL has been translated into 115 under-resourced languages, including many for which there exist no previous public resources, for a total of 6.1M translated tokens. SMOL comprises two sub-datasets, each carefully chosen for maximum impact given its size: SMOL-Sent, a set of sentences chosen for broad unique token coverage, and SMOL-Doc, a document-level source focusing on a broad topic coverage. They join the already released GATITOS for a trifecta of paragraph, sentence, and token-level content. We demonstrate that using SMOL to prompt or fine-tune Large Language Models yields robust ChrF improvements. In addition to translation, we provide factuality ratings and rationales for all documents in SMOL-Doc, yielding the first factuality datasets for most of these languages.
Problem

Research questions and friction points this paper is trying to address.

Translation for low-resource languages
Factuality datasets creation
Large Language Models fine-tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-source parallel data for 115 languages
Includes SMOL-Sent and SMOL-Doc datasets
Enhances LLMs with factuality ratings
🔎 Similar Papers
No similar papers found.