🤖 AI Summary
Addressing the challenge of deploying multilingual large language models (LLMs) for low-resource African languages on edge devices, this paper proposes a hybrid compression method integrating knowledge distillation with lightweight attention matching. We innovatively design an attention matrix consistency distillation mechanism tailored to low-resource multilingual settings and introduce a minimal student architecture. Compression is achieved under dual constraints—semantic fidelity and attention distribution alignment—enabling efficient model reduction. Using AfroXLMR-Large as the teacher model, our student model achieves over 85% of the teacher’s accuracy across five African languages, including Kinyarwanda and Swahili, while reducing parameter count by more than 85%. This substantially lowers computational and memory overhead, significantly enhancing feasibility for edge deployment.
📝 Abstract
Language model compression through knowledge distillation has emerged as a promising approach for deploying large language models in resource-constrained environments. However, existing methods often struggle to maintain performance when distilling multilingual models, especially for low-resource languages. In this paper, we present a novel hybrid distillation approach that combines traditional knowledge distillation with a simplified attention matching mechanism, specifically designed for multilingual contexts. Our method introduces an extremely compact student model architecture, significantly smaller than conventional multilingual models. We evaluate our approach on five African languages: Kinyarwanda, Swahili, Hausa, Igbo, and Yoruba. The distilled student model; AfroXLMR-Comet successfully captures both the output distribution and internal attention patterns of a larger teacher model (AfroXLMR-Large) while reducing the model size by over 85%. Experimental results demonstrate that our hybrid approach achieves competitive performance compared to the teacher model, maintaining an accuracy within 85% of the original model's performance while requiring substantially fewer computational resources. Our work provides a practical framework for deploying efficient multilingual models in resource-constrained environments, particularly benefiting applications involving African languages.