Enhancing Multilingual Embeddings via Multi-Way Parallel Text Alignment

📅 2026-02-24

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

This work addresses the limited cross-lingual alignment in existing multilingual pretrained models, which stems from the absence of explicit alignment signals. To overcome this, the authors systematically leverage multidirectional parallel corpora—covering six languages and generated using off-the-shelf neural machine translation models—and fine-tune models such as XLM-R, mBERT, and mE5 via contrastive learning. This approach substantially enhances cross-lingual representation quality. Compared to conventional English-centric bilingual data, it yields significant improvements on the MTEB benchmark across text retrieval (+21.3%), semantic similarity (+5.3%), and classification tasks (+28.4%). Notably, the gains extend to unseen languages, demonstrating the unique efficacy of multidirectional parallel data for cross-lingual alignment.

Technology Category

Application Category

📝 Abstract

Multilingual pretraining typically lacks explicit alignment signals, leading to suboptimal cross-lingual alignment in the representation space. In this work, we show that training standard pretrained models for cross-lingual alignment with a multi-way parallel corpus in a diverse pool of languages can substantially improve multilingual and cross-lingual representations for NLU tasks. We construct a multi-way parallel dataset using translations of English text from an off-the-shelf NMT model for a pool of six target languages and achieve strong cross-lingual alignment through contrastive learning. This leads to substantial performance gains across both seen and unseen languages for multiple tasks from the MTEB benchmark evaluated for XLM-Roberta and multilingual BERT base models. Using a multi-way parallel corpus for contrastive training yields substantial gains on bitext mining (21.3%), semantic similarity (5.3%), and classification (28.4%) compared to English-centric (En-X) bilingually parallel data, where X is sampled from a pool of multiple target languages. Furthermore, finetuning mE5 model on a small dataset with multi-way parallelism significantly improves bitext mining compared to one without, underscoring the importance of multi-way cross-lingual supervision even for models already pretrained for high-quality sentence embeddings.

Problem

Research questions and friction points this paper is trying to address.

multilingual embeddings

cross-lingual alignment

multi-way parallel corpus

representation space

NLU tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-way parallel corpus

contrastive learning

cross-lingual alignment