Multi-Scale Dequant: Eliminating Dequantization Bottleneck via Activation Decomposition for Efficient LLM Inference

📅 2026-05-13
📈 Citations: 0
Influential: 0
📄 PDF

career value

217K/year
🤖 AI Summary
In large language model inference, the dequantization of low-bit weights becomes a performance bottleneck on decoupled-architecture accelerators, leading to underutilized tensor cores. This work proposes Multi-Scale Dequantization (MSD), a framework that decomposes high-precision activations into multiple low-precision components to directly perform hardware-native GEMM with quantized weights, thereby removing dequantization from the critical path and reformulating it as multi-scale approximate computation. MSD is the first approach to eliminate dequantization overhead for both weights and KV cache through activation decomposition, introducing dual-path designs for INT8 and MXFP4 that achieve effective 16-bit and 6.6-bit precision, respectively, at identical GEMM latency while providing rigorous error bounds. Experiments show that MSD reduces L2 error in matrix multiplication and attention kernels, avoids Vector-Cube pipeline stalls, and cuts KV cache HBM bandwidth requirements by up to 2.5×.
📝 Abstract
Quantization is essential for efficient large language model (LLM) inference, yet the dequantization step-converting low-bit weights back to high-precision for matrix multiplication has become a critical bottleneck on modern AI accelerators. On architectures with decoupled compute units (e.g., Ascend NPUs), dequantization operations can consume more cycles than the matrix multiplication itself, leaving the high-throughput tensor cores underutilized. This paper presents Multi-Scale Dequant (MSD), a quantization framework that removes weight/KV dequantization from the GEMM critical path. Instead of lifting low-bit weights to BF16 precision, MSD decomposes high-precision BF16 activations into multiple low-precision components, each of which can be multiplied directly with quantized weights via native hardware-accelerated GEMM. This approach shifts the computational paradigm from precision conversion to multi-scale approximation, avoiding INT8-to-BF16 weight conversion before GEMM. We instantiate MSD for two weight formats and derive tight error bounds for each. For INT8 weights (W4A16), two-pass INT8 decomposition achieves near 16 effective bits. For MXFP4 weights (W4A16), two-pass MXFP4 decomposition yields near 6.6 effective bits with error bound 1/64 per block surpassing single-pass MXFP8(5.24 bits) while maintaining the same effective GEMM compute time. We further derive closed-form latency and HBM traffic models showing that MSD avoids the Vector-Cube pipeline stall caused by dequantization and reduces KV cache HBM traffic by up to 2.5 times in attention. Numerical simulations on matrix multiplication and Flash Attention kernels confirm that MSD does not degrade accuracy compared to dequantization baselines, and in many settings achieves lower L2 error.
Problem

Research questions and friction points this paper is trying to address.

dequantization bottleneck
efficient LLM inference
quantization
GEMM critical path
AI accelerators
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Scale Dequant
Activation Decomposition
Dequantization Bottleneck
Efficient LLM Inference
Low-Precision GEMM
🔎 Similar Papers
No similar papers found.