๐ค AI Summary
High deployment costs of large language models (LLMs) and the difficulty of existing block-pruning methods in simultaneously achieving high compression ratios and preserving model performance. Method: We propose a lightweight iterative block-pruning framework centered on a novel โrecovery-guided iterative pruningโ paradigm, integrating gradient-sensitive recovery initialization, multi-round structured fine-tuning, and task-aware evaluation. Contribution/Results: With only 2.5M tokens of recovery overhead, our method achieves an average 3% performance gain over baselines on Llama3.1-8B and Qwen2.5-7B, with up to 5% improvement on language understanding tasks and significantly enhanced multilingual robustness. Moreover, we empirically characterize architectural differences in block-level pruning behavior across LLMs, establishing a new paradigm and empirical foundation for efficient LLM compression.
๐ Abstract
Large Language Models (LLMs) have grown increasingly expensive to deploy, driving the need for effective model compression techniques. While block pruning offers a straightforward approach to reducing model size, existing methods often struggle to maintain performance or require substantial computational resources for recovery. We present IteRABRe, a simple yet effective iterative pruning method that achieves superior compression results while requiring minimal computational resources. Using only 2.5M tokens for recovery, our method outperforms baseline approaches by ~3% on average when compressing the Llama3.1-8B and Qwen2.5-7B models. IteRABRe demonstrates particular strength in the preservation of linguistic capabilities, showing an improvement 5% over the baselines in language-related tasks. Our analysis reveals distinct pruning characteristics between these models, while also demonstrating preservation of multilingual capabilities.