🤖 AI Summary
To address the high inference overhead, latency, and memory consumption in post-training deployment of large language models (LLMs), this paper identifies and exploits the previously overlooked inter-layer redundancy in Softmax computations during post-training. We propose the Softmax Unification Mechanism: sharing Softmax calculations across all Transformer attention layers, augmented by a lightweight linear error compensation module to preserve accuracy without structural or training modifications. The method is inherently compatible with post-training quantization. Experiments demonstrate that, while maintaining original model accuracy, our approach reduces KV cache memory by up to 42% and end-to-end inference latency by 31%. It significantly outperforms existing efficient architectures (e.g., KV sharing). The core innovations lie in the systematic discovery of cross-layer Softmax redundancy and the design of a lossless compensation strategy.
📝 Abstract
Post-training is essential for adapting Large Language Models (LLMs) to real-world applications. Deploying post-trained models faces significant challenges due to substantial memory overhead and noticeable inference latency. Existing work has identified significant redundancies in LLMs and proposed efficient architectures, namely intra-layer KV sharing and cross-layer KV sharing. However, intra-layer KV sharing still results in high inference costs, while cross-layer KV sharing leads to significant performance degradation. As a result, both methods remain suboptimal for post-training pre-trained LLMs. In this paper, we identify that the exttt{Softmax} operation is a primary bottleneck for LLM inference and discover that it is actually highly redundant during post-training. We propose Softmax extbf{Uni}fication in extbf{Att}e extbf{n}tion ( extbf{UniAttn}), a novel post-training method that unifies Softmax activations across transformer blocks to reduce LLM inference costs. Additionally, UniAttn adopts a linear projection to compensate for the errors induced by Softmax unification. Experiments show that UniAttn matches the performance of standard post-training while significantly reducing inference costs, outperforming existing efficient architectures during post-training. Our code will be available at url{https://github.com/Bostoncake/UniAttn}.