EG-MLA: Embedding-Gated Multi-head Latent Attention for Scalable and Efficient LLMs

📅 2025-09-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the prohibitively high memory overhead of key-value (KV) caches in large language model (LLM) inference—which hinders low-latency deployment—this paper proposes Gated Latent Attention (GLA). GLA extends the Multi-Head Latent Attention (MLA) framework by introducing a token-aware embedded gating mechanism that performs fine-grained modulation of compressed KV vectors within a shared latent space, implicitly capturing higher-order interactions while preserving representational fidelity. Compared to standard Multi-Head Attention (MHA), GLA reduces KV cache memory by 91.6%; relative to MLA, it further cuts memory usage by 59.9%. Empirically, GLA improves accuracy across multiple reasoning benchmarks and scales successfully to models exceeding 1 billion parameters. These results significantly enhance the deployability of large-scale LLMs in resource-constrained environments.

Technology Category

Application Category

📝 Abstract
Reducing the key-value (KV) cache size is a crucial step toward enabling efficient inference in large language models (LLMs), especially under latency and memory constraints. While Multi-Head Attention (MHA) offers strong representational power, it incurs significant memory overhead. Recent work on Multi-head Latent Attention (MLA) mitigates this by compressing KV representations into a shared latent space, achieving a better trade-off between performance and cache efficiency. While MLA already achieves significant KV cache reduction, the scope for further compression remains limited without performance loss. In this paper, we propose extbf{Embedding-Gated Multi-head Latent Attention (EG-MLA)}, a novel extension of MLA that further reduces KV cache size while enhancing representational expressiveness. EG-MLA introduces a token-specific embedding gating mechanism applied in the latent space, enabling fine-grained modulation of compressed KV vectors with minimal additional computation. Compared to MHA, EG-MLA achieves over 91.6% reduction in KV cache size with negligible performance degradation. Relative to MLA, EG-MLA consistently improves task accuracy across diverse reasoning benchmarks while achieving up to 59.9% additional memory savings. Our theoretical analysis highlights how embedding gating induces implicit high-order interactions, and empirical evaluations demonstrate robust generalization across model scales and compression regimes. Notably, we successfully scale EG-MLA to over 1 billion parameters, demonstrating its practical viability for large-scale LLM deployment. These results establish EG-MLA as a memory- and compute-efficient attention mechanism that enables scalable, high-performance inference in modern LLMs.
Problem

Research questions and friction points this paper is trying to address.

Reducing KV cache size for efficient LLM inference
Improving memory efficiency without performance degradation
Enhancing representational expressiveness in compressed attention
Innovation

Methods, ideas, or system contributions that make the work stand out.

Embedding-gated mechanism modulates compressed KV vectors
Token-specific gating in latent space enhances expressiveness
Achieves over 91.6% KV cache reduction versus MHA
🔎 Similar Papers
No similar papers found.
Z
Zhengge Cai
Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen University
Haowen Hou
Haowen Hou
Assistant Professor, Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ)
RWKVLLMVLMInformation Retrieval