🤖 AI Summary
Existing MLA decoding kernels (e.g., FlashMLA) rely on the absorb method to reduce HBM bandwidth consumption, but their compute-bound nature limits acceleration gains from data reuse—particularly in shared-prefix scenarios. This work proposes a hybrid attention kernel that dynamically integrates naive and absorb computation paradigms for the first time: it adaptively switches between them under shared-prefix conditions, jointly optimizing computational efficiency and memory bandwidth utilization. Built upon Multi-Head Latent Attention, the kernel synergistically combines FlashAttention’s and FlashMLA’s memory access strategies to significantly enhance data reuse. Experiments demonstrate 3.0× and 3.24× attention throughput improvements on NPU and GPU platforms, respectively, with only a 3% increase in HBM capacity overhead. The core contribution lies in breaking the inherent compute-bound constraint imposed by the absorb paradigm, enabling co-optimization of memory bandwidth and computational throughput.
📝 Abstract
Multi-Head Latent Attention (MLA) is a recent attention mechanism adopted in state-of-the-art LLMs such as DeepSeek-v3 and Kimi K2. Thanks to its novel formulation, MLA allows two functionally equivalent but computationally distinct kernel implementations: naive and absorb. While the naive kernels (e.g., FlashAttention) are typically preferred in training and prefill for their computational efficiency, existing decoding kernels (e.g., FlashMLA) rely on the absorb method to minimize HBM bandwidth usage. However, the compute-bound nature of the absorb implementations prohibits performance benefits from data reuse opportunities in attention calculations, such as shared prefixes. In this work, we introduce TyphoonMLA, a hybrid approach that combines naive and absorb formulations to harness the strengths of both. TyphoonMLA effectively leverages the shared prefix by applying the naive formulation to the compute-bound parts of attention calculations, while reducing the bandwidth requirements for non-shared parts by using the absorb formulation. As a result, TyphoonMLA improves the throughput of attention calculations in MLA architectures by up to 3x and 3.24x on NPU and GPUs, with only a 3% overhead in HBM size.