GPTailor: Large Language Model Pruning Through Layer Cutting and Stitching

πŸ“… 2025-06-25
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Efficiently compressing multiple fine-tuned large language model (LLM) variants remains challenging due to the need to preserve diverse task-specific capabilities while reducing parameters. Method: This paper proposes a structured pruning framework based on hierarchical selection and layer concatenation. Unlike conventional single-model pruning, it performs cross-model selection, pruning, and fusion of critical layers across multiple fine-tuned variants, constructing a zeroth-order combinatorial optimization search space for end-to-end joint optimization of parameter count and capability retention. Technically, it integrates structured pruning, inter-layer merging, and multi-candidate layer selection. Contribution/Results: Evaluated on Llama2-13B, the method achieves 25% parameter compression with an average task performance retention of 97.3%, outperforming existing state-of-the-art approaches. To our knowledge, this is the first work to formulate multi-variant collaborative pruning as a zeroth-order combinatorial optimization problem, establishing a novel paradigm for LLM lightweight deployment.

Technology Category

Application Category

πŸ“ Abstract
Large language models (LLMs) have shown remarkable capabilities in language understanding and generation. However, such impressive capability typically comes with a substantial model size, which presents significant challenges in deployment and inference. While structured pruning of model parameters offers a promising way to reduce computational costs at deployment time, current methods primarily focus on single model pruning. In this work, we develop a novel strategy to compress models by strategically combining or merging layers from finetuned model variants, which preserves the original model's abilities by aggregating capabilities accentuated in different finetunes. We pose the optimal tailoring of these LLMs as a zero-order optimization problem, adopting a search space that supports three different operations: (1) Layer removal, (2) Layer selection from different candidate models, and (3) Layer merging. Our experiments demonstrate that this approach leads to competitive model pruning, for example, for the Llama2-13B model families, our compressed models maintain approximately 97.3% of the original performance while removing $sim25%$ of parameters, significantly outperforming previous state-of-the-art methods. The code is available at https://github.com/Guinan-Su/auto-merge-llm.
Problem

Research questions and friction points this paper is trying to address.

Reducing LLM size for easier deployment and inference
Pruning models by merging layers from finetuned variants
Optimizing layer removal, selection, and merging for efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Strategic layer cutting and stitching for model pruning
Zero-order optimization for optimal layer selection
Combining finetuned model variants to preserve capabilities
πŸ”Ž Similar Papers
No similar papers found.