AMLA: MUL by ADD in FlashAttention Rescaling

📅 2025-09-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high computational overhead and intermediate tensor bloat induced by Multi-head Latent Attention (MLA) during large language model decoding, this paper proposes AMLA—a high-performance kernel optimized for Huawei’s Ascend NPU. AML innovatively integrates the FlashAttention framework, replaces floating-point multiplications with integer additions for output block rescaling, and introduces a preloading pipeline coupled with hierarchical tiling, leveraging FP32/INT32 binary mapping to deeply overlap Cube-core computation with data movement. Evaluated on the Ascend 910 NPU, AML achieves 614 TFLOPS with an 86.8% FLOPS utilization—significantly surpassing FlashMLA’s 66.7% on the H800 GPU. The kernel has been integrated into Huawei’s CANN software ecosystem and is scheduled for open-source release.

Technology Category

Application Category

📝 Abstract
Multi-head Latent Attention (MLA) significantly reduces KVCache memory usage in Large Language Models while introducing substantial computational overhead and intermediate variable expansion. This poses challenges for efficient hardware implementation -- especially during the decode phase. This paper introduces Ascend MLA (AMLA), a high-performance kernel specifically optimized for Huawei's Ascend NPUs. AMLA is built on two core innovations: (1) A novel FlashAttention-based algorithm that replaces floating-point multiplications with integer additions for output block rescaling, leveraging binary correspondence between FP32 and INT32 representations; (2) A Preload Pipeline strategy with hierarchical tiling that maximizes FLOPS utilization: the Preload Pipeline achieves Cube-bound performance, while hierarchical tiling overlaps data movement and computation within the Cube core. Experiments show that on Ascend 910 NPUs (integrated in CloudMatrix384), AMLA achieves up to 614 TFLOPS, reaching 86.8% of the theoretical maximum FLOPS, outperforming the state-of-the-art open-source FlashMLA implementation, whose FLOPS utilization is up to 66.7% on NVIDIA H800 SXM5. The AMLA kernel has been integrated into Huawei's CANN and will be released soon.
Problem

Research questions and friction points this paper is trying to address.

Reduces KVCache memory usage in Large Language Models
Addresses computational overhead and intermediate variable expansion
Optimizes hardware implementation for efficient decode phase
Innovation

Methods, ideas, or system contributions that make the work stand out.

Replaces multiplications with additions for rescaling
Uses preload pipeline for cube-bound performance
Employs hierarchical tiling to overlap operations
🔎 Similar Papers
No similar papers found.
Qichen Liao
Qichen Liao
Huawei
C
Chengqiu Hu
Huawei
F
Fangzheng Miao
Huawei
B
Bao Li
Huawei
Yiyang Liu
Yiyang Liu
University of Missouri - Kansas City
NLPCVMultimodal
J
Junlong Lyu
Huawei
L
Lirui Jiang
Huawei
J
Jun Wang
Huawei
L
Lingchao Zheng
Huawei
J
Jun Li
Huawei
Y
Yuwei Fan
Huawei