Pruning via Merging: Compressing LLMs via Manifold Alignment Based Layer Merging

📅 2024-06-24
🏛️ Conference on Empirical Methods in Natural Language Processing
📈 Citations: 2
Influential: 1
📄 PDF
🤖 AI Summary
Deploying large language models (LLMs) in resource-constrained environments remains challenging due to their computational and memory demands, while existing pruning methods often discard critical knowledge embedded in removed parameters. To address this, we propose Manifold-aligned Knowledge Aggregation (MKA), a layer-merging compression method grounded in manifold learning and the information bottleneck principle. MKA quantifies inter-layer semantic similarity via manifold alignment, explicitly identifies functionally redundant layers, and merges them with knowledge-preserving aggregation—enabling knowledge transfer rather than parameter elimination. Notably, MKA is the first to incorporate manifold alignment into hierarchical knowledge distillation–based layer merging, departing fundamentally from conventional pruning paradigms. Evaluated on Llama3-8B, MKA achieves a 43.75% model size reduction with only a 2.82% drop in MMLU accuracy—substantially outperforming pruning baselines. Moreover, MKA synergizes effectively with quantization, further enhancing compression efficiency.

Technology Category

Application Category

📝 Abstract
While large language models (LLMs) excel in many domains, their complexity and scale challenge deployment in resource-limited environments. Current compression techniques, such as parameter pruning, often fail to effectively utilize the knowledge from pruned parameters. To address these challenges, we propose Manifold-Based Knowledge Alignment and Layer Merging Compression (MKA), a novel approach that uses manifold learning and the Information Bottleneck (IB) measure to merge similar layers, reducing model size while preserving essential performance. We evaluate MKA on multiple benchmark datasets and various LLMs. Our findings show that MKA not only preserves model performance but also achieves substantial compression ratios, outperforming traditional pruning methods. Moreover, when coupled with quantization, MKA delivers even greater compression. Specifically, on the MMLU dataset using the Llama3-8B model, MKA achieves a compression ratio of 43.75% with a minimal performance decrease of only 2.82%. The proposed MKA method offers a resource-efficient and performance-preserving model compression technique for LLMs. We make our code available at https://github.com/SempraETY/Pruning-via-Merging
Problem

Research questions and friction points this paper is trying to address.

Compress LLMs for resource-limited environments effectively
Utilize pruned parameters' knowledge via manifold alignment
Achieve high compression ratios with minimal performance loss
Innovation

Methods, ideas, or system contributions that make the work stand out.

Manifold learning merges similar layers
NPIB measure preserves essential performance
Combines with quantization for greater compression
🔎 Similar Papers
No similar papers found.
D
Deyuan Liu
Harbin Institute of Technology
Z
Zhanyue Qin
Harbin Institute of Technology
H
Hairu Wang
University of Science and Technology of China
Z
Zhao Yang
Institute of automation, Chinese Academy of Sciences
Z
Zecheng Wang
Harbin Institute of Technology
F
Fangying Rong
Shandong Agricultural University
Q
Qingbin Liu
Tencent Inc
Y
Yanchao Hao
Tencent Inc
X
Xi Chen
Tencent Inc
C
Cunhang Fan
Anhui University
Z
Zhao Lv
Anhui University
Zhiying Tu
Zhiying Tu
Harbin Institute of Technology
software engineering
D
Dianhui Chu
Harbin Institute of Technology
Dianbo Sui
Dianbo Sui
Harbin Institute of Technology