LLM Compression: How Far Can We Go in Balancing Size and Performance?

📅 2025-08-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the feasibility and performance limits of 4-bit quantization for lightweight deployment of large language models (LLMs). Addressing information retrieval, Boolean question answering, and mathematical reasoning—three representative tasks—we systematically compare Group Scaling Quantization (GSQ) and GPTQ, two state-of-the-art 4-bit quantization methods, across LLaMA, Qwen, and Phi model families. Evaluations are conducted on MS MARCO, BoolQ, and GSM8K benchmarks. Our experiments reveal, for the first time, systematic trade-offs between quantization accuracy, model scale, and task type: under 4-bit quantization, models retain over 90% of original accuracy while achieving ~75% memory reduction, 40–60% lower inference latency, and 1.8–2.3× higher throughput. These findings establish a reproducible quantization benchmark and provide practical guidelines for efficient LLM deployment.

Technology Category

Application Category

📝 Abstract
Quantization is an essential and popular technique for improving the accessibility of large language models (LLMs) by reducing memory usage and computational costs while maintaining performance. In this study, we apply 4-bit Group Scaling Quantization (GSQ) and Generative Pretrained Transformer Quantization (GPTQ) to LLaMA 1B, Qwen 0.5B, and PHI 1.5B, evaluating their impact across multiple NLP tasks. We benchmark these models on MS MARCO (Information Retrieval), BoolQ (Boolean Question Answering), and GSM8K (Mathematical Reasoning) datasets, assessing both accuracy and efficiency across various tasks. The study measures the trade-offs between model compression and task performance, analyzing key evaluation metrics, namely accuracy, inference latency, and throughput (total output tokens generated per second), providing insights into the suitability of low-bit quantization for real-world deployment. Using the results, users can then make suitable decisions based on the specifications that need to be met. We discuss the pros and cons of GSQ and GPTQ techniques on models of different sizes, which also serve as a benchmark for future experiments.
Problem

Research questions and friction points this paper is trying to address.

Evaluate 4-bit quantization impact on LLM performance
Assess trade-offs between model compression and task accuracy
Compare GSQ and GPTQ techniques for real-world deployment
Innovation

Methods, ideas, or system contributions that make the work stand out.

4-bit Group Scaling Quantization for LLMs
Generative Pretrained Transformer Quantization technique
Evaluating accuracy and efficiency trade-offs
S
Sahil Sk
Odia Generative AI, India
D
Debasish Dhal
Odia Generative AI, India
S
Sonal Khosla
Odia Generative AI, India
S
Sk Shahid
Odia Generative AI, India
S
Sambit Shekhar
Odia Generative AI, India
A
Akash Dhaka
AMD Silo AI, Finland
S
Shantipriya Parida
AMD Silo AI, Finland
Dilip K. Prasad
Dilip K. Prasad
Professor, UiT The Arctic University of Norway
Pattern RecognitionArtificial IntelligenceComputer VisionMachine Learning
Ondřej Bojar
Ondřej Bojar
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics
machine translationspeech translationparsingtreebanking