🤖 AI Summary
This work identifies a systematic ranking bias in multilingual pretrained language models (multiPLMs)—including LASER3, XLM-R, and LaBSE—when scoring parallel sentence pairs for low-resource languages, leading to noisy sentence pairs being erroneously ranked at the top and degrading neural machine translation (NMT) performance. To address this, we propose a lightweight, interpretable heuristic debiasing method for corpus cleaning: it integrates multilingual sentence embeddings for similarity-based ranking while incorporating controllable heuristics—such as sentence length, entropy, and language identification—for filtering. Evaluated on CCMatrix and CCAligned benchmarks, our approach consistently improves NMT performance across low-resource directions (e.g., En→Si/Ta, Si→Ta), yielding an average BLEU gain of +1.8. Moreover, it reduces performance variance across different multiPLM choices by 67%, marking the first systematic investigation and mitigation of inherent ranking bias in multiPLM-based parallel corpus mining.
📝 Abstract
Parallel Data Curation (PDC) techniques aim to filter out noisy parallel sentences from the web-mined corpora. Prior research has demonstrated that ranking sentence pairs using similarity scores on sentence embeddings derived from Pre-trained Multilingual Language Models (multiPLMs) and training the NMT systems with the top-ranked samples, produces superior NMT performance than when trained using the full dataset. However, previous research has shown that the choice of multiPLM significantly impacts the ranking quality. This paper investigates the reasons behind this disparity across multiPLMs. Using the web-mined corpora CCMatrix and CCAligned for En$
ightarrow$Si, En$
ightarrow$Ta and Si$
ightarrow$Ta, we show that different multiPLMs (LASER3, XLM-R, and LaBSE) are biased towards certain types of sentences, which allows noisy sentences to creep into the top-ranked samples. We show that by employing a series of heuristics, this noise can be removed to a certain extent. This results in improving the results of NMT systems trained with web-mined corpora and reduces the disparity across multiPLMs.