W4A16 Mixed-Precision Matrix Multiplication on Decoupled Architecture: Kernel Design and Memory Bottleneck Analysis for Ascend NPUs

📅 2026-01-23

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This work addresses the challenge of efficiently deploying W4A16 (4-bit weights, 16-bit activations) mixed-precision matrix multiplication on the Ascend 910 NPU, where limited native support and a decoupled architecture induce severe memory bottlenecks. We propose the first practical W4A16 GEMM kernel tailored for this architecture, co-optimizing computation and memory access through on-the-fly INT4-to-FP16 dequantization in vector cores, high-throughput computation in cube cores, and a Split-K parallelization strategy. Our analysis reveals that memory transfer—not dequantization—is the primary performance bottleneck. In typical LLM decoding scenarios (where K ≫ N), the proposed kernel achieves speedups of 1.01–1.74× over data-parallel baselines and up to 1.48× over native FP16 GEMM, establishing a new paradigm for efficient deployment of quantized large language models on dedicated accelerators.

Technology Category

Application Category

📝 Abstract

As Large Language Models (LLMs) scale, weight-only quantization (W4A16: 4-bit weights, 16-bit activations) becomes critical for reducing memory footprint with minimal accuracy loss. However, its efficient deployment on Huawei's Ascend 910 Neural Processing Unit (NPU) is challenging due to limited native mixed-precision support and the accelerator's decoupled compute architecture. To enable quantization on such architecture, we present the first practical W4A16 matrix multiplication kernel tailored for the Ascend 910 NPU. Our design leverages vector cores for on-the-fly INT4-to-FP16 dequantization, cube cores for high-throughput GEMM, and Split-K parallelization to mitigate memory latency. Performance evaluations across diverse matrix shapes and batch sizes show our method outperforms data-parallel approaches when K>>N, a typical scenario in LLM decoding. Specially, our method can achieve a speedup ranging from 1.01x to 1.74x. In addition, our profile reveals the primary bottleneck is not dequantization compution itself, but extra global memory transfer for the weight, making W4A16 only reaching a maximum speedup of 1.48x over native FP16xFP16 matrix multiplication in PyTorch. In the long run, our method lays a solid foundation and provides insightful views for the efficient deployment of quantized large language models on various domain-specific accelerators.

Problem

Research questions and friction points this paper is trying to address.

W4A16

mixed-precision

memory bottleneck

decoupled architecture

LLM quantization

Innovation

Methods, ideas, or system contributions that make the work stand out.

W4A16

mixed-precision

decoupled architecture