On Multilingual Encoder Language Model Compression for Low-Resource Languages

📅 2025-05-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of compressing multilingual encoders for low-resource languages—where severe model shrinkage often leads to substantial language knowledge loss—this paper proposes the first monolingual distillation framework tailored for extreme compression (up to 92%). Methodologically, it integrates two-stage knowledge distillation, structured pruning, Transformer layer truncation, and dynamic vocabulary reduction, enabling coordinated compression across depth, feed-forward network dimensionality, and embedding size. It further quantifies, for the first time, a power-law relationship between teacher-side language data volume and downstream performance degradation. Evaluated on three low-resource languages across four downstream tasks—sentiment analysis, topic classification, named entity recognition, and part-of-speech tagging—the compressed models incur only 2–10% average accuracy drop, substantially outperforming baselines. Ablation studies validate the contribution of each component, establishing best practices for efficient compression in low-resource settings.

Technology Category

Application Category

📝 Abstract
In this paper, we combine two-step knowledge distillation, structured pruning, truncation, and vocabulary trimming for extremely compressing multilingual encoder-only language models for low-resource languages. Our novel approach systematically combines existing techniques and takes them to the extreme, reducing layer depth, feed-forward hidden size, and intermediate layer embedding size to create significantly smaller monolingual models while retaining essential language-specific knowledge. We achieve compression rates of up to 92% with only a marginal performance drop of 2-10% in four downstream tasks, including sentiment analysis, topic classification, named entity recognition, and part-of-speech tagging, across three low-resource languages. Notably, the performance degradation correlates with the amount of language-specific data in the teacher model, with larger datasets resulting in smaller performance losses. Additionally, we conduct extensive ablation studies to identify best practices for multilingual model compression using these techniques.
Problem

Research questions and friction points this paper is trying to address.

Extreme compression of multilingual encoder models for low-resource languages
Retaining language-specific knowledge while reducing model size
Minimizing performance drop in downstream tasks after compression
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-step knowledge distillation for compression
Structured pruning and vocabulary trimming
Extreme layer and embedding size reduction
🔎 Similar Papers
No similar papers found.