🤖 AI Summary
Existing multilingual pre-trained language models (mPLMs) often omit explicit language embeddings, forcing token representations to encode all language-specific information and thereby compromising language neutrality. To address this, we propose a continual pre-training method that jointly incorporates language and script embeddings into the Transformer output layer of XLM-R. Our approach introduces learnable dual identifiers—language and script tokens—enabling the first unified modeling of linguistic and orthographic systems. This explicitly decouples language-specific features from script-invariant ones and supports source-language adaptive selection to enhance interpretability in zero-shot cross-lingual transfer. After continual pre-training on corpora spanning 500+ languages, our model consistently outperforms baselines across diverse downstream tasks under zero-shot settings. Cosine similarity analysis confirms significantly improved language neutrality in learned representations. The code and models are publicly released.
📝 Abstract
Recent multilingual pretrained language models (mPLMs) often avoid using language embeddings -- learnable vectors assigned to individual languages. However, this places a significant burden on token representations to encode all language-specific information, which may hinder language neutrality. To address this limitation, we propose Language-Script Aware Multilingual Pretraining (LangSAMP), a method that incorporates both language and script embeddings to enhance representation learning. Specifically, we integrate these embeddings into the output of the Transformer blocks before passing the final representations to the language modeling head for prediction. We apply LangSAMP to the continual pretraining of XLM-R on a highly multilingual corpus covering more than 500 languages. The resulting model consistently outperforms the baseline in zero-shot crosslingual transfer across diverse downstream tasks. Extensive analysis reveals that language and script embeddings capture language- and script-specific nuances, which benefits more language-neutral representations, proven by improved pairwise cosine similarity. In our case study, we also show that language and script embeddings can be used to select better source languages for crosslingual transfer. We make our code and models publicly available at https://github.com/cisnlp/LangSAMP.