Kimi Linear: An Expressive, Efficient Attention Architecture

📅 2025-10-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing attention mechanisms struggle to consistently outperform full attention across short/long-context and reinforcement learning scenarios, while underutilizing finite-state RNN memory. This paper introduces Kimi Linear—a hybrid linear attention architecture that is the first to significantly surpass full attention in all three settings. Our approach addresses these limitations through three key contributions: (1) Kimi Delta Attention (KDA), a fine-grained gated variant of Gated DeltaNet that enhances state modeling; (2) a hardware-efficient chunkwise algorithm and simplified diagonal-plus-low-rank (DPLR) matrix design for improved compute and memory access patterns; and (3) a layer-wise hybrid structure integrating KDA with multi-head latent attention (MLA), which optimizes DPLR state transitions—reducing KV cache size by 75% and boosting decoding throughput by up to 6× on million-token contexts. The model and code are publicly released.

Technology Category

Application Category

📝 Abstract
We introduce Kimi Linear, a hybrid linear attention architecture that, for the first time, outperforms full attention under fair comparisons across various scenarios -- including short-context, long-context, and reinforcement learning (RL) scaling regimes. At its core lies Kimi Delta Attention (KDA), an expressive linear attention module that extends Gated DeltaNet with a finer-grained gating mechanism, enabling more effective use of limited finite-state RNN memory. Our bespoke chunkwise algorithm achieves high hardware efficiency through a specialized variant of the Diagonal-Plus-Low-Rank (DPLR) transition matrices, which substantially reduces computation compared to the general DPLR formulation while remaining more consistent with the classical delta rule. We pretrain a Kimi Linear model with 3B activated parameters and 48B total parameters, based on a layerwise hybrid of KDA and Multi-Head Latent Attention (MLA). Our experiments show that with an identical training recipe, Kimi Linear outperforms full MLA with a sizeable margin across all evaluated tasks, while reducing KV cache usage by up to 75% and achieving up to 6 times decoding throughput for a 1M context. These results demonstrate that Kimi Linear can be a drop-in replacement for full attention architectures with superior performance and efficiency, including tasks with longer input and output lengths. To support further research, we open-source the KDA kernel and vLLM implementations, and release the pre-trained and instruction-tuned model checkpoints.
Problem

Research questions and friction points this paper is trying to address.

Introducing a hybrid linear attention architecture outperforming full attention
Developing expressive linear attention with finer-grained gating mechanism
Achieving superior performance while reducing KV cache and increasing throughput
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid linear attention architecture outperforms full attention
KDA module extends Gated DeltaNet with finer-grained gating
Chunkwise algorithm uses specialized DPLR matrices for efficiency
🔎 Similar Papers
No similar papers found.