🤖 AI Summary
To address the prohibitively high memory overhead of key-value (KV) caches in large language model (LLM) inference—which hinders low-latency deployment—this paper proposes Gated Latent Attention (GLA). GLA extends the Multi-Head Latent Attention (MLA) framework by introducing a token-aware embedded gating mechanism that performs fine-grained modulation of compressed KV vectors within a shared latent space, implicitly capturing higher-order interactions while preserving representational fidelity. Compared to standard Multi-Head Attention (MHA), GLA reduces KV cache memory by 91.6%; relative to MLA, it further cuts memory usage by 59.9%. Empirically, GLA improves accuracy across multiple reasoning benchmarks and scales successfully to models exceeding 1 billion parameters. These results significantly enhance the deployability of large-scale LLMs in resource-constrained environments.
📝 Abstract
Reducing the key-value (KV) cache size is a crucial step toward enabling efficient inference in large language models (LLMs), especially under latency and memory constraints. While Multi-Head Attention (MHA) offers strong representational power, it incurs significant memory overhead. Recent work on Multi-head Latent Attention (MLA) mitigates this by compressing KV representations into a shared latent space, achieving a better trade-off between performance and cache efficiency. While MLA already achieves significant KV cache reduction, the scope for further compression remains limited without performance loss. In this paper, we propose extbf{Embedding-Gated Multi-head Latent Attention (EG-MLA)}, a novel extension of MLA that further reduces KV cache size while enhancing representational expressiveness. EG-MLA introduces a token-specific embedding gating mechanism applied in the latent space, enabling fine-grained modulation of compressed KV vectors with minimal additional computation. Compared to MHA, EG-MLA achieves over 91.6% reduction in KV cache size with negligible performance degradation. Relative to MLA, EG-MLA consistently improves task accuracy across diverse reasoning benchmarks while achieving up to 59.9% additional memory savings. Our theoretical analysis highlights how embedding gating induces implicit high-order interactions, and empirical evaluations demonstrate robust generalization across model scales and compression regimes. Notably, we successfully scale EG-MLA to over 1 billion parameters, demonstrating its practical viability for large-scale LLM deployment. These results establish EG-MLA as a memory- and compute-efficient attention mechanism that enables scalable, high-performance inference in modern LLMs.