🤖 AI Summary
To address the sharp cross-lingual performance degradation in one-shot pruning of multilingual large language models—caused by neglecting linguistic disparities—this paper proposes a language-aware dynamic pruning method. The core innovation lies in (1) explicitly modeling multilingual capability preservation as an optimization objective, (2) characterizing cross-lingual differences via language-grouped activation statistics to dynamically allocate layer-wise sparsity, and (3) adapting the Wanda pruning criterion for multilingual settings. Evaluated on multiple multilingual benchmarks, the method significantly mitigates performance collapse at moderate sparsity levels (30%–50%), yielding average gains of 2.1–4.7 BLEU/accuracy points with negligible computational overhead. Results demonstrate that language-aware sparsity scheduling is critical for balancing multilingual performance retention and model compression efficiency.
📝 Abstract
Multilingual LLM performance is often critically dependent on model size. With an eye on efficiency, this has led to a surge in interest in one-shot pruning methods that retain the benefits of large-scale pretraining while shrinking the model size. However, as pruning tends to come with performance loss, it is important to understand the trade-offs between multilinguality and sparsification. In this work, we study multilingual performance under different sparsity constraints and show that moderate ratios already substantially harm performance. To help bridge this gap, we propose M-Wanda, a pruning method that models cross-lingual variation by incorporating language-aware activation statistics into its pruning criterion and dynamically adjusts layerwise sparsity based on cross-lingual importance. We show that M-Wanda consistently improves performance at minimal additional costs. We are the first to explicitly optimize pruning to retain multilingual performance, and hope to inspire future advances in multilingual pruning.