IteRABRe: Iterative Recovery-Aided Block Reduction

๐Ÿ“… 2025-03-08
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
High deployment costs of large language models (LLMs) and the difficulty of existing block-pruning methods in simultaneously achieving high compression ratios and preserving model performance. Method: We propose a lightweight iterative block-pruning framework centered on a novel โ€œrecovery-guided iterative pruningโ€ paradigm, integrating gradient-sensitive recovery initialization, multi-round structured fine-tuning, and task-aware evaluation. Contribution/Results: With only 2.5M tokens of recovery overhead, our method achieves an average 3% performance gain over baselines on Llama3.1-8B and Qwen2.5-7B, with up to 5% improvement on language understanding tasks and significantly enhanced multilingual robustness. Moreover, we empirically characterize architectural differences in block-level pruning behavior across LLMs, establishing a new paradigm and empirical foundation for efficient LLM compression.

Technology Category

Application Category

๐Ÿ“ Abstract
Large Language Models (LLMs) have grown increasingly expensive to deploy, driving the need for effective model compression techniques. While block pruning offers a straightforward approach to reducing model size, existing methods often struggle to maintain performance or require substantial computational resources for recovery. We present IteRABRe, a simple yet effective iterative pruning method that achieves superior compression results while requiring minimal computational resources. Using only 2.5M tokens for recovery, our method outperforms baseline approaches by ~3% on average when compressing the Llama3.1-8B and Qwen2.5-7B models. IteRABRe demonstrates particular strength in the preservation of linguistic capabilities, showing an improvement 5% over the baselines in language-related tasks. Our analysis reveals distinct pruning characteristics between these models, while also demonstrating preservation of multilingual capabilities.
Problem

Research questions and friction points this paper is trying to address.

Reduces LLM deployment costs via effective compression.
Maintains performance with minimal computational resources.
Preserves linguistic and multilingual capabilities post-compression.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Iterative pruning method for model compression
Minimal computational resources for recovery
Preserves linguistic and multilingual capabilities
๐Ÿ”Ž Similar Papers
No similar papers found.