🤖 AI Summary
Existing pruning methods for large language models (LLMs) typically treat layer redundancy as an intrinsic property of pre-trained models, overlooking the critical influence of the evaluation objective on redundancy assessment. This work proposes a functional perspective, arguing that redundancy should be jointly determined by the model and the calibration objective. To investigate this, the authors construct a comprehensive experimental framework encompassing three LLM architectures, two types of calibration objectives, and seven search algorithms. Their findings reveal that different calibration objectives—such as perplexity versus downstream task accuracy—yield substantially different rankings of redundant layers, whereas under a fixed objective, diverse search algorithms produce highly consistent results. This demonstrates that the choice of calibration objective exerts a far greater impact on pruning outcomes than the selection of search algorithm.
📝 Abstract
Depth pruning improves the inference efficiency of large language models by removing Transformer blocks. Prior work has focused on importance criteria and search algorithms, often treating layer redundancy as an inherent structural property of pretrained networks. In contrast, we adopt a \emph{functional perspective}, where redundancy is jointly influenced by the model and the evaluation objective, suggesting that a universal ranking may not be sufficient. Through an empirical study across three LLM families, two calibration objectives, and seven search algorithms, we observe that different objectives yield qualitatively different redundant layers, and that perplexity and downstream accuracy rankings do not consistently align. Under a fixed objective, however, search algorithms tend to produce similar solutions. Overall, our results suggest that the calibration objective may play a more influential role than the choice of search algorithm, indicating that further attention to objective design could be beneficial.