Utilizing Multilingual Encoders to Improve Large Language Models for Low-Resource Languages

📅 2025-08-12

📈 Citations: 0

✨ Influential: 0

career value

151K/year

🤖 AI Summary

To address the significant performance degradation of large language models (LLMs) on low-resource languages (LRLs)—a consequence of English-centric pretraining—this paper proposes a lightweight, multilingual- and parallel-data-free cross-lingual transfer method. The approach introduces a multi-layer fusion architecture that dynamically injects all intermediate-layer representations from an mT5 multilingual encoder into the LLM; a token-level learnable weighting mechanism—comprising global Softmax and Transformer Softmax—to adaptively model layer importance; and embedding-space mapping to align semantic representations across languages. Crucially, the method is trained exclusively on English data. Empirical results demonstrate substantial improvements in LRL understanding: average accuracy on XNLI reaches 71.50%; Sinhala classification accuracy increases by 4.2 percentage points (from 71.66% to 75.86%); and Indo-Aryan languages consistently benefit, significantly outperforming the LangBridge baseline.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) excel in English, but their performance degrades significantly on low-resource languages (LRLs) due to English-centric training. While methods like LangBridge align LLMs with multilingual encoders such as the Massively Multilingual Text-to-Text Transfer Transformer (mT5), they typically use only the final encoder layer. We propose a novel architecture that fuses all intermediate layers, enriching the linguistic information passed to the LLM. Our approach features two strategies: (1) a Global Softmax weighting for overall layer importance, and (2) a Transformer Softmax model that learns token-specific weights. The fused representations are mapped into the LLM's embedding space, enabling it to process multilingual inputs. The model is trained only on English data, without using any parallel or multilingual data. Evaluated on XNLI, IndicXNLI, Sinhala News Classification, and Amazon Reviews, our Transformer Softmax model significantly outperforms the LangBridge baseline. We observe strong performance gains in LRLs, improving Sinhala classification accuracy from 71.66% to 75.86% and achieving clear improvements across Indic languages such as Tamil, Bengali, and Malayalam. These specific gains contribute to an overall boost in average XNLI accuracy from 70.36% to 71.50%. This approach offers a scalable, data-efficient path toward more capable and equitable multilingual LLMs.

Problem

Research questions and friction points this paper is trying to address.

Improving LLM performance for low-resource languages using multilingual encoders

Enhancing linguistic information by fusing all intermediate encoder layers

Enabling multilingual input processing without parallel training data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fuses all intermediate multilingual encoder layers

Uses Global Softmax and Transformer Softmax weighting

Maps multilingual inputs into LLM's embedding space

🔎 Similar Papers

Exploring Multilingual Probing in Large Language Models: A Cross-Language Analysis