Cross-layer Attention Sharing for Large Language Models

πŸ“… 2024-08-04
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 6
✨ Influential: 2
πŸ“„ PDF
πŸ€– AI Summary
To address the high redundancy in cross-layer attention mechanisms of large language models (LLMs), which severely hampers inference efficiency, this paper proposes LiSAβ€”a lightweight self-attention alternative. Methodologically, LiSA first systematically reveals strong similarity across attention patterns across LLM layers; it then introduces a joint mechanism of head reordering alignment and low-rank differential modeling to overcome the failure of direct weight sharing and sensitivity to shallow layers. Specifically, it employs a tiny feed-forward network for inter-layer attention head alignment, low-rank matrix decomposition to capture inter-layer weight discrepancies, and integrates KV cache compression with head grouping optimization. Evaluated on 13 benchmarks, LiSA achieves zero degradation in perplexity and generation quality, reduces redundant attention computation by 53–84%, compresses Q/K parameters by 6Γ—, and improves throughput by 19.5% on LLaMA3-8B and 32.3% on LLaMA2-7B.

Technology Category

Application Category

πŸ“ Abstract
As large language models (LLMs) evolve, the increase in model depth and parameter number leads to substantial redundancy. To enhance the efficiency of the attention mechanism, previous works primarily compress the KV cache or group attention heads, while largely overlooking redundancy between layers. Our comprehensive analyses across various LLMs show that highly similar attention patterns persist within most layers. It's intuitive to save the computation by sharing attention weights across layers. However, further analysis reveals two challenges: (1) Directly sharing the weight matrix without carefully rearranging the attention heads proves to be ineffective; (2) Shallow layers are vulnerable to small deviations in attention weights. Driven by these insights, we introduce LiSA, a lightweight substitute for self-attention in well-trained LLMs. LiSA employs tiny feed-forward networks to align attention heads between adjacent layers and low-rank matrices to approximate differences in layer-wise attention weights. Evaluations encompassing 13 typical benchmarks demonstrate that LiSA maintains high response quality in terms of accuracy and perplexity while reducing redundant attention calculations within 53-84% of the total layers. Our implementations of LiSA achieve a 6X compression of Q and K, with maximum throughput improvements of 19.5% for LLaMA3-8B and 32.3% for LLaMA2-7B.
Problem

Research questions and friction points this paper is trying to address.

Reducing redundancy in attention mechanisms across layers
Addressing challenges in sharing attention weights between layers
Maintaining response quality while compressing attention calculations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Shares attention weights across model layers
Uses tiny networks to align adjacent attention heads
Employs low-rank matrices for weight differences
πŸ”Ž Similar Papers
No similar papers found.
Yongyu Mu
Yongyu Mu
Northeastern University
multilingualismmachine translationefficient models
Y
Yuzhang Wu
NLP Lab, School of Computer Science and Engineering, Northeastern University, Shenyang, China
Yuchun Fan
Yuchun Fan
Northeastern University
C
Chenglong Wang
NLP Lab, School of Computer Science and Engineering, Northeastern University, Shenyang, China
H
Hengyu Li
NLP Lab, School of Computer Science and Engineering, Northeastern University, Shenyang, China
Qiaozhi He
Qiaozhi He
ByteDance
LLMNatural Language Processing
M
Murun Yang
NLP Lab, School of Computer Science and Engineering, Northeastern University, Shenyang, China
T
Tong Xiao
NLP Lab, School of Computer Science and Engineering, Northeastern University, Shenyang, China; NiuTrans Research, Shenyang, China
Jingbo Zhu
Jingbo Zhu
Northeastern University, China
Machine TranslationLanguage ParsingNatural Language Processing