Multi-matrix Factorization Attention

📅 2024-12-26

📈 Citations: 0

✨ Influential: 0

career value

239K/year

🤖 AI Summary

To address performance degradation in large language model inference caused by KV cache bottlenecks in attention mechanisms, this paper proposes Multi-matrix Factorization Attention (MFA) and its Key Reuse variant (MFA-KR). MFA introduces low-rank matrix decomposition into the QK pathway, jointly scaling the number of attention heads and hidden dimension to enhance model capacity without increasing trainable parameters. MFA-KR further reparameterizes the key cache as a function of the value cache, achieving extreme KV cache compression. Both methods operate under strict memory constraints while preserving full model expressivity. Experiments demonstrate that MFA reduces KV cache usage by 56%, while MFA-KR achieves a remarkable 93.7% reduction—both matching standard multi-head attention (MHA) in accuracy and significantly outperforming the state-of-the-art MLA method. Crucially, neither approach introduces additional trainable parameters, ensuring parameter efficiency and seamless integration into existing transformer architectures.

Technology Category

Application Category

📝 Abstract

We propose novel attention architectures, Multi-matrix Factorization Attention (MFA) and MFA-Key-Reuse (MFA-KR). Existing variants for standard Multi-Head Attention (MHA), including SOTA methods like MLA, fail to maintain as strong performance under stringent Key-Value cache (KV cache) constraints. MFA enhances model capacity by efficiently scaling up both the number and dimension of attention heads through low-rank matrix factorization in the Query-Key (QK) circuit. Extending MFA, MFA-KR further reduces memory requirements by repurposing the key cache as value through value projection re-parameterization. MFA's design enables strong model capacity when working under tight KV cache budget, while MFA-KR is suitable for even harsher KV cache limits with minor performance trade-off. Notably, in our extensive and large-scale experiments, the proposed architecture outperforms MLA and performs comparably to MHA, while reducing KV cache usage by up to 56% and 93.7%, respectively.

Problem

Research questions and friction points this paper is trying to address.

Matrix Factorization

Attention Mechanism

Cache-Limited Performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

MFA

MFA-KR

Attention Mechanism Optimization

🔎 Similar Papers

A fast Multiplicative Updates algorithm for Non-negative Matrix Factorization