🤖 AI Summary
To address the challenges of compressing multilingual encoders for low-resource languages—where severe model shrinkage often leads to substantial language knowledge loss—this paper proposes the first monolingual distillation framework tailored for extreme compression (up to 92%). Methodologically, it integrates two-stage knowledge distillation, structured pruning, Transformer layer truncation, and dynamic vocabulary reduction, enabling coordinated compression across depth, feed-forward network dimensionality, and embedding size. It further quantifies, for the first time, a power-law relationship between teacher-side language data volume and downstream performance degradation. Evaluated on three low-resource languages across four downstream tasks—sentiment analysis, topic classification, named entity recognition, and part-of-speech tagging—the compressed models incur only 2–10% average accuracy drop, substantially outperforming baselines. Ablation studies validate the contribution of each component, establishing best practices for efficient compression in low-resource settings.
📝 Abstract
In this paper, we combine two-step knowledge distillation, structured pruning, truncation, and vocabulary trimming for extremely compressing multilingual encoder-only language models for low-resource languages. Our novel approach systematically combines existing techniques and takes them to the extreme, reducing layer depth, feed-forward hidden size, and intermediate layer embedding size to create significantly smaller monolingual models while retaining essential language-specific knowledge. We achieve compression rates of up to 92% with only a marginal performance drop of 2-10% in four downstream tasks, including sentiment analysis, topic classification, named entity recognition, and part-of-speech tagging, across three low-resource languages. Notably, the performance degradation correlates with the amount of language-specific data in the teacher model, with larger datasets resulting in smaller performance losses. Additionally, we conduct extensive ablation studies to identify best practices for multilingual model compression using these techniques.