Linguistic Entity Masking to Improve Cross-Lingual Representation of Multilingual Language Models for Low-Resource Languages

📅 2025-01-10

📈 Citations: 0

✨ Influential: 0

career value

156K/year

🤖 AI Summary

Multilingual pretrained models exhibit weak cross-lingual representation capabilities for low-resource languages. Method: We propose Linguistic Entity Masking (LEM), a targeted masking strategy for continual pretraining that selectively masks only salient tokens—specifically nouns, verbs, and tokens within named entities—while preserving richer contextual information, in contrast to conventional random uniform masking used in MLM and TLM. LEM integrates part-of-speech tagging and named entity recognition as auxiliary linguistic signals to jointly optimize both MLM and TLM objectives. Contribution/Results: Evaluated on three low-resource language pairs—English–Sinhala, English–Tamil, and Sinhala–Tamil—LEM achieves significant improvements over the MLM+TLM baseline across bitext alignment mining, parallel data filtering, and mixed-code sentiment analysis. This work is the first to empirically validate the effectiveness of fine-grained, linguistically informed masking for enhancing cross-lingual modeling in low-resource settings.

Technology Category

Application Category

📝 Abstract

Multilingual Pre-trained Language models (multiPLMs), trained on the Masked Language Modelling (MLM) objective are commonly being used for cross-lingual tasks such as bitext mining. However, the performance of these models is still suboptimal for low-resource languages (LRLs). To improve the language representation of a given multiPLM, it is possible to further pre-train it. This is known as continual pre-training. Previous research has shown that continual pre-training with MLM and subsequently with Translation Language Modelling (TLM) improves the cross-lingual representation of multiPLMs. However, during masking, both MLM and TLM give equal weight to all tokens in the input sequence, irrespective of the linguistic properties of the tokens. In this paper, we introduce a novel masking strategy, Linguistic Entity Masking (LEM) to be used in the continual pre-training step to further improve the cross-lingual representations of existing multiPLMs. In contrast to MLM and TLM, LEM limits masking to the linguistic entity types nouns, verbs and named entities, which hold a higher prominence in a sentence. Secondly, we limit masking to a single token within the linguistic entity span thus keeping more context, whereas, in MLM and TLM, tokens are masked randomly. We evaluate the effectiveness of LEM using three downstream tasks, namely bitext mining, parallel data curation and code-mixed sentiment analysis using three low-resource language pairs English-Sinhala, English-Tamil, and Sinhala-Tamil. Experiment results show that continually pre-training a multiPLM with LEM outperforms a multiPLM continually pre-trained with MLM+TLM for all three tasks.

Problem

Research questions and friction points this paper is trying to address.

Multilingual Pretraining

Resource-poor Languages

Performance Improvement

Innovation

Methods, ideas, or system contributions that make the work stand out.

Language Entity Masking

Multilingual Pretraining

Cross-lingual Information Retrieval

🔎 Similar Papers

Exploring Multilingual Probing in Large Language Models: A Cross-Language Analysis