🤖 AI Summary
Existing pruning methods for multilingual large language models (LLMs) commonly rely solely on English calibration data, neglecting linguistic diversity and potentially compromising cross-lingual capability retention. Method: We conduct the first systematic empirical study—across languages (English, Chinese, French, Spanish), tasks (downstream understanding + language modeling), and models (mT5, BLOOM, Qwen)—evaluating mainstream pruning techniques (Magnitude, SNIP, GRASP) under calibration in target versus English-only settings. Results: Target-language calibration effectively preserves language-specific modeling capacity but does not consistently improve downstream task performance. Crucially, language-agnostic reasoning and knowledge representations prove highly fragile during pruning and are disproportionately removed. Our findings uncover a critical inconsistency between calibration language choice and functional capability preservation, challenging the assumption of calibration-language neutrality. This work provides both theoretical insight into representation vulnerability in multilingual LLMs and practical guidance for designing language-aware pruning strategies to enable efficient, equitable multilingual model compression.
📝 Abstract
Recent advances in large language model (LLM) pruning have shown state-of-the-art (SotA) compression results in post-training and retraining-free settings while maintaining high predictive performance. However, previous research mainly considered calibrating based on English text, despite the multilingual nature of modern LLMs and their frequent use in non-English languages. In this paper, we set out to investigate calibrating the pruning of multilingual language models for monolingual applications. We present the first comprehensive empirical study, comparing different calibration languages for pruning multilingual models across diverse languages, tasks, models, and SotA pruning techniques. Our results offer practical suggestions, for example, calibrating in the target language can efficiently retain the language modeling capability but does not necessarily benefit downstream tasks. Through further analysis of latent subspaces, pruning masks, and individual neurons within pruned models, we find that while pruning generally preserves strong language-specific features, it may fail to retain language-specific neuron activation patterns and subtle, language-agnostic features associated with knowledge and reasoning that are needed for complex tasks.