TyphoonMLA: A Mixed Naive-Absorb MLA Kernel For Shared Prefix

📅 2025-09-25

📈 Citations: 0

✨ Influential: 0

career value

235K/year

🤖 AI Summary

Existing MLA decoding kernels (e.g., FlashMLA) rely on the absorb method to reduce HBM bandwidth consumption, but their compute-bound nature limits acceleration gains from data reuse—particularly in shared-prefix scenarios. This work proposes a hybrid attention kernel that dynamically integrates naive and absorb computation paradigms for the first time: it adaptively switches between them under shared-prefix conditions, jointly optimizing computational efficiency and memory bandwidth utilization. Built upon Multi-Head Latent Attention, the kernel synergistically combines FlashAttention’s and FlashMLA’s memory access strategies to significantly enhance data reuse. Experiments demonstrate 3.0× and 3.24× attention throughput improvements on NPU and GPU platforms, respectively, with only a 3% increase in HBM capacity overhead. The core contribution lies in breaking the inherent compute-bound constraint imposed by the absorb paradigm, enabling co-optimization of memory bandwidth and computational throughput.

Technology Category

Application Category

📝 Abstract

Multi-Head Latent Attention (MLA) is a recent attention mechanism adopted in state-of-the-art LLMs such as DeepSeek-v3 and Kimi K2. Thanks to its novel formulation, MLA allows two functionally equivalent but computationally distinct kernel implementations: naive and absorb. While the naive kernels (e.g., FlashAttention) are typically preferred in training and prefill for their computational efficiency, existing decoding kernels (e.g., FlashMLA) rely on the absorb method to minimize HBM bandwidth usage. However, the compute-bound nature of the absorb implementations prohibits performance benefits from data reuse opportunities in attention calculations, such as shared prefixes. In this work, we introduce TyphoonMLA, a hybrid approach that combines naive and absorb formulations to harness the strengths of both. TyphoonMLA effectively leverages the shared prefix by applying the naive formulation to the compute-bound parts of attention calculations, while reducing the bandwidth requirements for non-shared parts by using the absorb formulation. As a result, TyphoonMLA improves the throughput of attention calculations in MLA architectures by up to 3x and 3.24x on NPU and GPUs, with only a 3% overhead in HBM size.

Problem

Research questions and friction points this paper is trying to address.

Optimizing attention calculations in MLA architectures for efficiency

Addressing compute-bound limitations of absorb kernels in decoding phase

Leveraging shared prefixes to improve throughput while minimizing bandwidth

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid naive-absorb kernel for MLA

Leverages shared prefixes with naive formulation

Reduces bandwidth via absorb for non-shared parts

🔎 Similar Papers

No similar papers found.