Latent-Condensed Transformer for Efficient Long Context Modeling

📅 2026-04-14

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This work addresses the computational and memory bottlenecks in large language models arising from the linear growth of KV cache and the quadratic complexity of self-attention over long contexts. The authors propose Latent-Condensed Attention (LCA), which, for the first time, enables native sparse compression within the low-dimensional semantic space of Multi-head Latent Attention (MLA). LCA employs query-aware pooling to aggregate semantic vectors and integrates positional key anchors to jointly reduce both computational cost and KV cache size. The method is architecture-agnostic and comes with a theoretical guarantee of a length-independent error upper bound. Experiments demonstrate that, at a 128K context length, LCA achieves up to 2.5× faster prefilling and a 90% reduction in KV cache while maintaining competitive model performance.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) face significant challenges in processing long contexts due to the linear growth of the key-value (KV) cache and quadratic complexity of self-attention. Existing approaches address these bottlenecks separately: Multi-head Latent Attention (MLA) reduces the KV cache by projecting tokens into a low-dimensional latent space, while sparse attention reduces computation. However, sparse methods cannot operate natively on MLA's compressed latent structure, missing opportunities for joint optimization. In this paper, we propose Latent-Condensed Attention (LCA), which directly condenses context within MLA's latent space, where the representation is disentangled into semantic latent vectors and positional keys. LCA separately aggregates semantic vectors via query-aware pooling and preserves positional keys via anchor selection. This approach jointly reduces both computational cost and KV cache without adding parameters. Beyond MLA, LCA's design is architecture-agnostic and readily extends to other attention mechanisms such as GQA. Theoretically, we prove a length-independent error bound. Experiments show LCA achieves up to 2.5$\times$ prefilling speedup and 90% KV cache reduction at 128K context while maintaining competitive performance.

Problem

Research questions and friction points this paper is trying to address.

long context modeling

KV cache

self-attention complexity

latent space compression

efficient attention

Innovation

Methods, ideas, or system contributions that make the work stand out.

Latent-Condensed Attention

KV cache reduction

long context modeling