GQSA: Group Quantization and Sparsity for Accelerating Large Language Model Inference

📅 2024-12-23

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

228K/year

🤖 AI Summary

To address the challenge of simultaneously minimizing memory footprint, computational overhead, and accuracy degradation in large language model (LLM) inference under high compression ratios, this paper proposes GQSA—a holistic algorithm-system co-designed framework integrating grouped quantization and structured sparsity. Its key contributions are: (1) a novel task-centric sparse parallelism strategy that improves GPU utilization; (2) a two-stage sparse optimization method enabling tight coupling between sparsity and 4-bit weight quantization (W4); and (3) flexible, tunable sparsity ratios and higher weight compression ratios. Under the W4S50% configuration (4-bit weights with 50% sparsity), GQSA achieves higher accuracy than both 2:4 pruning and W2 quantization, while delivering 1.26× faster inference than W2 and 2.35× faster than 2:4 pruning—significantly surpassing the performance limits of single-compression paradigms.

Technology Category

Application Category

📝 Abstract

Model compression has emerged as a mainstream solution to reduce memory usage and computational overhead. This paper presents Group Quantization and Sparse Acceleration (GQSA), a novel compression technique tailored for LLMs. Traditional methods typically focus exclusively on either quantization or sparsification, but relying on a single strategy often results in significant performance loss at high compression rates. In contrast, GQSA integrates quantization and sparsification in a tightly coupled manner, leveraging GPU-friendly structured group sparsity and quantization for efficient acceleration. Building upon system-algorithm co-design principles, we propose a two-stage sparse optimization strategy that ensures the performance superiority of the compressed model. On the engine side, we introduce a"task-centric"parallel strategy, which, to the best of our knowledge, is the first application in the domain of sparse computing. Compared to the traditional 2:4 sparse method, the GQSA offers a more flexible and adjustable sparsity rate, as well as a higher weight compression rate, and is efficiently compatible with weight-only quantization methods. Experimental results demonstrate that, under the GQSA W4S50% compression setting, the model's accuracy surpasses that of both 2:4 pruning and W2 quantization. Furthermore, at the inference level, GQSA outperforms W2 by 1.26$ imes$ and 2:4 pruning by 2.35$ imes$ in terms of speed.

Problem

Research questions and friction points this paper is trying to address.

Accelerates large language model inference

Reduces memory usage and computational overhead

Integrates quantization and sparsification techniques

Innovation

Methods, ideas, or system contributions that make the work stand out.

Group Quantization and Sparse Acceleration

GPU-friendly structured group sparsity

Task-centric parallel strategy

🔎 Similar Papers

No similar papers found.