🤖 AI Summary
Large language models (LLMs) face significant challenges in GPU deployment, including high memory footprint, low inference throughput, and substantial accuracy degradation under aggressive low-bit quantization. To address these issues, this paper proposes a hardware-cooperative, layer-adaptive non-uniform quantization method. Designed with explicit GPU architecture awareness, it employs weight-distribution-aware binning and lookup-table (LUT)-driven mixed-precision GEMM, enabling training-free post-training quantization while eliminating conventional uniform quantization and inefficient dequantization paths. The approach introduces a novel quantization paradigm that is layer-customized, hardware-native, and LUT-accelerated. Experimental results demonstrate significantly reduced perplexity gaps at 3/4-bit precision and achieve a 2.57× inference speedup on a single RTX 4090 GPU. Moreover, the method surpasses state-of-the-art approaches in both memory efficiency and quantization accuracy.
📝 Abstract
Large Language Models (LLMs) face significant deployment challenges due to their substantial resource requirements. While low-bit quantized weights can reduce memory usage and improve inference efficiency, current hardware lacks native support for mixed-precision General Matrix Multiplication (mpGEMM), resulting in inefficient dequantization-based implementations. Moreover, uniform quantization methods often fail to capture weight distributions adequately, leading to performance degradation. We propose GANQ (GPU-Adaptive Non-Uniform Quantization), a layer-wise post-training non-uniform quantization framework optimized for hardware-efficient lookup table-based mpGEMM. GANQ achieves superior quantization performance by utilizing a training-free, GPU-adaptive optimization algorithm to efficiently reduce layer-wise quantization errors. Extensive experiments demonstrate GANQ's ability to reduce the perplexity gap from the FP16 baseline compared to state-of-the-art methods for both 3-bit and 4-bit quantization. Furthermore, when deployed on a single NVIDIA RTX 4090 GPU, GANQ's quantized models achieve up to 2.57$ imes$ speedup over the baseline, advancing memory and inference efficiency in LLM deployment.