🤖 AI Summary
Existing multilingual large language model (LLM) compression methods rely on English-centric calibration sets, leading to substantial performance degradation—particularly for low-resource languages. Method: This paper proposes a novel multilingual calibration data sampling strategy grounded in the language distribution of the original training corpus, enabling calibration set construction that faithfully mirrors the proportional representation of languages in the pretraining data—the first such approach. It further integrates model pruning, quantization, and BLOOM-specific architectural adaptations, validated via cross-lingual performance attribution analysis. Contribution/Results: Experiments on BLOOM demonstrate that our method significantly narrows the cross-lingual performance gap: BLEU scores for multiple low-resource languages improve by over 15%. The work uncovers a synergistic interaction between linguistic similarity and training-data language proportion in preserving post-compression multilingual performance. Overall, it establishes a practical, non-English-centric paradigm for efficient multilingual LLM compression.
📝 Abstract
Large Language Models (LLMs) have ushered in a new era in Natural Language Processing, but their massive size demands effective compression techniques for practicality. Although numerous model compression techniques have been investigated, they typically rely on a calibration set that overlooks the multilingual context and results in significant accuracy degradation for low-resource languages. This paper introduces Multilingual Brain Surgeon (MBS), a novel calibration data sampling method for multilingual LLMs compression. MBS overcomes the English-centric limitations of existing methods by sampling calibration data from various languages proportionally to the language distribution of the model training datasets. Our experiments, conducted on the BLOOM multilingual LLM, demonstrate that MBS improves the performance of existing English-centric compression methods, especially for low-resource languages. We also uncover the dynamics of language interaction during compression, revealing that the larger the proportion of a language in the training set and the more similar the language is to the calibration language, the better performance the language retains after compression. In conclusion, MBS presents an innovative approach to compressing multilingual LLMs, addressing the performance disparities and improving the language inclusivity of existing compression techniques. Keywords: Large Language Model, Multilingual Model Compression