UniAttn: Reducing Inference Costs via Softmax Unification for Post-Training LLMs

📅 2025-02-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high inference overhead, latency, and memory consumption in post-training deployment of large language models (LLMs), this paper identifies and exploits the previously overlooked inter-layer redundancy in Softmax computations during post-training. We propose the Softmax Unification Mechanism: sharing Softmax calculations across all Transformer attention layers, augmented by a lightweight linear error compensation module to preserve accuracy without structural or training modifications. The method is inherently compatible with post-training quantization. Experiments demonstrate that, while maintaining original model accuracy, our approach reduces KV cache memory by up to 42% and end-to-end inference latency by 31%. It significantly outperforms existing efficient architectures (e.g., KV sharing). The core innovations lie in the systematic discovery of cross-layer Softmax redundancy and the design of a lossless compensation strategy.

Technology Category

Application Category

📝 Abstract
Post-training is essential for adapting Large Language Models (LLMs) to real-world applications. Deploying post-trained models faces significant challenges due to substantial memory overhead and noticeable inference latency. Existing work has identified significant redundancies in LLMs and proposed efficient architectures, namely intra-layer KV sharing and cross-layer KV sharing. However, intra-layer KV sharing still results in high inference costs, while cross-layer KV sharing leads to significant performance degradation. As a result, both methods remain suboptimal for post-training pre-trained LLMs. In this paper, we identify that the exttt{Softmax} operation is a primary bottleneck for LLM inference and discover that it is actually highly redundant during post-training. We propose Softmax extbf{Uni}fication in extbf{Att}e extbf{n}tion ( extbf{UniAttn}), a novel post-training method that unifies Softmax activations across transformer blocks to reduce LLM inference costs. Additionally, UniAttn adopts a linear projection to compensate for the errors induced by Softmax unification. Experiments show that UniAttn matches the performance of standard post-training while significantly reducing inference costs, outperforming existing efficient architectures during post-training. Our code will be available at url{https://github.com/Bostoncake/UniAttn}.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Resource Efficiency
Computational Cost
Innovation

Methods, ideas, or system contributions that make the work stand out.

Softmax Unified Attention
Efficient Model Architecture
Transformer Blocks Sharing
🔎 Similar Papers
No similar papers found.