GANQ: GPU-Adaptive Non-Uniform Quantization for Large Language Models

📅 2025-01-22

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Large language models (LLMs) face significant challenges in GPU deployment, including high memory footprint, low inference throughput, and substantial accuracy degradation under aggressive low-bit quantization. To address these issues, this paper proposes a hardware-cooperative, layer-adaptive non-uniform quantization method. Designed with explicit GPU architecture awareness, it employs weight-distribution-aware binning and lookup-table (LUT)-driven mixed-precision GEMM, enabling training-free post-training quantization while eliminating conventional uniform quantization and inefficient dequantization paths. The approach introduces a novel quantization paradigm that is layer-customized, hardware-native, and LUT-accelerated. Experimental results demonstrate significantly reduced perplexity gaps at 3/4-bit precision and achieve a 2.57× inference speedup on a single RTX 4090 GPU. Moreover, the method surpasses state-of-the-art approaches in both memory efficiency and quantization accuracy.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) face significant deployment challenges due to their substantial resource requirements. While low-bit quantized weights can reduce memory usage and improve inference efficiency, current hardware lacks native support for mixed-precision General Matrix Multiplication (mpGEMM), resulting in inefficient dequantization-based implementations. Moreover, uniform quantization methods often fail to capture weight distributions adequately, leading to performance degradation. We propose GANQ (GPU-Adaptive Non-Uniform Quantization), a layer-wise post-training non-uniform quantization framework optimized for hardware-efficient lookup table-based mpGEMM. GANQ achieves superior quantization performance by utilizing a training-free, GPU-adaptive optimization algorithm to efficiently reduce layer-wise quantization errors. Extensive experiments demonstrate GANQ's ability to reduce the perplexity gap from the FP16 baseline compared to state-of-the-art methods for both 3-bit and 4-bit quantization. Furthermore, when deployed on a single NVIDIA RTX 4090 GPU, GANQ's quantized models achieve up to 2.57$ imes$ speedup over the baseline, advancing memory and inference efficiency in LLM deployment.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

GPU Efficiency

Quantization Challenges

Innovation

Methods, ideas, or system contributions that make the work stand out.

GANQ

Non-uniform Quantization

Performance Optimization

🔎 Similar Papers

No similar papers found.

Authors to Follow