Do Not Worry if You Do Not Have Data: Building Pretrained Language Models Using Translationese

📅 2024-03-20

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

169K/year

🤖 AI Summary

To address the limited pretraining effectiveness for low-resource languages (LRLs) caused by scarcity of natural monolingual corpora, this paper systematically investigates the utility of translationese—synthetic text generated via machine translation—as a viable pretraining data source. We propose TinyLM, a lightweight knowledge distillation–based filtering framework to efficiently identify high-quality synthetic sentences, and introduce a hybrid pretraining strategy that combines filtered translationese with a small amount of clean monolingual data (only 10%). Using this approach, we train 28M- and 85M-parameter Transformer models for Hindi. On NLU and NLG downstream tasks, these models achieve performance within 3.56% and 1.51% of the full clean-data baseline, respectively—substantially outperforming the pure-translationese baseline. This work provides the first empirical evidence that translationese can serve as an effective surrogate for natural text in LRL pretraining. To support further research, we publicly release IndicMonoDoc, a large-scale Hindi monolingual corpus.

Technology Category

Application Category

📝 Abstract

In this paper, we explore the utility of Translationese as synthetic data created using machine translation for pre-training language models (LMs). Pre-training requires vast amounts of monolingual data, which is mostly unavailable for languages other than English. Recently, there has been a growing interest in using synthetic data to address this data scarcity. We take the case of English and Indic languages and translate web-crawled monolingual documents (clean) into the target language. Then, we train language models containing 28M and 85M parameters on this translationese data (synthetic). We show that their performance on downstream natural language understanding and generative tasks is only 3.56% poorer on NLU tasks and 1.51% on NLG tasks than LMs pre-trained on clean data. Further, we propose the use of lightweight TinyLMs pre-trained on clean data to filter synthetic data efficiently which significantly improves the performance of our models. We also find that LMs trained on synthetic data strongly benefit from extended pretraining on a tiny fraction (10%) of clean data. We release the data we collected and created as a part of this work, IndicMonoDoc, the largest collection of monolingual document-level corpora, which we hope will help bridge the gap between English and non-English performance for large language models.

Problem

Research questions and friction points this paper is trying to address.

Using translationese for pre-training low-resource language models

Evaluating synthetic data impact on NLU and NLG tasks

Bridging pre-training gaps between English and low-resource languages

Innovation

Methods, ideas, or system contributions that make the work stand out.

Using translationese as synthetic data for pretraining

Filtering translated documents with tiny LMs

Continual pretraining with synthetic data competes clean data

🔎 Similar Papers

No similar papers found.