🤖 AI Summary
Existing deep pruning methods rely on a single similarity metric, leading to unstable performance across diverse architectures. This work proposes SimDiff, which introduces an orthogonal dual-perspective evaluation mechanism that jointly considers representational similarity and transformation dissimilarity to more robustly assess layer importance. Specifically, SimDiff synergistically quantifies inter-layer relationships using cosine similarity, Mean Squared Successive Difference (MSSD), and Mean Absolute Successive Difference (MASD), balancing sensitivity to anomalies with overall robustness. Experimental results demonstrate that SimDiff significantly outperforms state-of-the-art methods across models ranging from 0.5B to 13B parameters: for instance, LLaMA2-7B retains 91% of its original performance at a 25% pruning ratio, and LLaMA3.1-8B achieves a 1.49× inference speedup by pruning 12 layers.
📝 Abstract
Depth pruning improves the deployment efficiency of large language models (LLMs) by identifying and removing redundant layers. A widely accepted standard for this identification process is to measure the similarity between layers using cosine distance. However, we find that methods relying solely on this one-dimensional heuristic can exhibit unpredictable performance and even catastrophic collapse across different architectures. To address this issue, we propose SimDiff, a novel layer importance criterion that jointly evaluates layers from two orthogonal perspectives: representational similarity and transformation difference. The difference is quantified using two distinct metrics: MSSD, which is sensitive to outliers and identifies layers that make decisive corrections, and MASD, which robustly measures a layer's average contribution. Extensive experiments on multiple models ranging from 0.5B to 13B parameters demonstrate that SimDiff significantly outperforms state-of-the-art baselines across various pruning ratios. Notably, our method retains over 91% of LLaMA2-7B's performance at a 25% pruning ratio and achieves up to a 1.49x inference speedup when pruning 12 layers on LLaMA3.1-8B. We also show that pruned models can be effectively recovered with minimal fine-tuning.