🤖 AI Summary
To address the memory and computational overhead caused by KV cache expansion in large language model (LLM) inference, this paper proposes an end-to-end MHA-to-GQA compression paradigm. Our method introduces an “align-then-merge” strategy: first applying orthogonal transformations across attention heads to enhance inter-head similarity, then incorporating a RoPE-compatible L0-sparse mask to enable differentiable, arbitrary-ratio KV head pruning. This is the first approach supporting differentiable, end-to-end MHA-to-GQA conversion under RoPE position encoding, requiring only supervised fine-tuning for deployment. Evaluated on LLaMA2-7B with 87.5% KV head compression, our method achieves near-lossless performance on major benchmarks—including MMLU and ARC—while maintaining full compatibility with standard GQA inference frameworks.
📝 Abstract
Large language models have been shown to perform well on a variety of natural language processing problems. However, as the model size and the input sequence's length increase, the rapid increase of KV Cache significantly slows down inference speed. Therefore GQA model, as an alternative to MHA model, has been widely introduced into LLMs. In this work, we propose a low-cost method for pruning MHA models into GQA models with any compression ratio of key-value heads. Our method is based on $mathit{L_0}$ masks to gradually remove redundant parameters. In addition, we apply orthogonal transformations to attention heads without changing the model to increase similarity between attention heads before pruning training, in order to further improve performance of the model. Our method can be compatible with rotary position embedding (RoPE), which means the model after training can be fully adapted to the mainstream standard GQA framework. Experiments demonstrate that our strategy can compress up to 87.5% of key-value heads of the LLaMA2-7B model without too much performance degradation, just achieved through supervised fine-tuning.