One Script Instead of Hundreds? On Pretraining Romanized Encoder Language Models

📅 2026-01-09

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This study investigates whether romanization serves as a suitable unified representation for universal multilingual pretraining, with particular attention to its impact on performance and information loss in high-resource languages. We pretrain encoder models from scratch on six typologically diverse high-resource languages using original scripts and two levels of romanization fidelity, systematically evaluating script-induced information loss and cross-lingual interference effects. Our work provides the first comprehensive validation of romanization’s universality in high-resource settings, revealing that phonemic-script languages (e.g., English, German) retain nearly full performance after romanization while achieving higher encoding efficiency, whereas syllabic-script languages (e.g., Chinese, Japanese) suffer significant degradation—only partially mitigated by high-fidelity romanization. We further demonstrate that subword overlap does not induce negative interference, underscoring script type as the key determinant of romanization efficacy.

Technology Category

Application Category

📝 Abstract

Exposing latent lexical overlap, script romanization has emerged as an effective strategy for improving cross-lingual transfer (XLT) in multilingual language models (mLMs). Most prior work, however, focused on setups that favor romanization the most: (1) transfer from high-resource Latin-script to low-resource non-Latin-script languages and/or (2) between genealogically closely related languages with different scripts. It thus remains unclear whether romanization is a good representation choice for pretraining general-purpose mLMs, or, more precisely, if information loss associated with romanization harms performance for high-resource languages. We address this gap by pretraining encoder LMs from scratch on both romanized and original texts for six typologically diverse high-resource languages, investigating two potential sources of degradation: (i) loss of script-specific information and (ii) negative cross-lingual interference from increased vocabulary overlap. Using two romanizers with different fidelity profiles, we observe negligible performance loss for languages with segmental scripts, whereas languages with morphosyllabic scripts (Chinese and Japanese) suffer degradation that higher-fidelity romanization mitigates but cannot fully recover. Importantly, comparing monolingual LMs with their mLM counterpart, we find no evidence that increased subword overlap induces negative interference. We further show that romanization improves encoding efficiency (i.e., fertility) for segmental scripts at a negligible performance cost.

Problem

Research questions and friction points this paper is trying to address.

romanization

cross-lingual transfer

multilingual language models

script loss

lexical overlap

Innovation

Methods, ideas, or system contributions that make the work stand out.

romanization

multilingual language models

cross-lingual transfer