Exploring the Trade-Offs: Quantization Methods, Task Difficulty, and Model Size in Large Language Models From Edge to Giant

📅 2024-09-17

📈 Citations: 3

✨ Influential: 1

career value

171K/year

🤖 AI Summary

This work systematically investigates the trade-offs among quantization methods, model scale (1B–405B), and task difficulty, with emphasis on the robustness of higher-order capabilities—such as instruction following and hallucination detection—across edge to large-scale model deployments. We evaluate four quantization schemes—AWQ, GPTQ, FP8, and INT4—on 13 diverse tasks spanning language understanding, reasoning, code generation, and STEM domains, using MT-Bench as the primary evaluator. Key findings are: (1) Quantization does not degrade performance linearly but amplifies inherent model deficiencies; (2) FP8 exhibits superior cross-task robustness, yielding an average MT-Bench score gain of +2.3 points; (3) 70B models suffer <1.5% accuracy loss under 4-bit quantization, whereas 1B models degrade by >12%; (4) Quantized large models consistently outperform same-scale FP16 small models, yet instruction following and hallucination detection incur significant degradation.

Technology Category

Application Category

📝 Abstract

Quantization has gained attention as a promising solution for the cost-effective deployment of large and small language models. However, most prior work has been limited to perplexity or basic knowledge tasks and lacks a comprehensive evaluation of recent models like Llama-3.3. In this paper, we conduct a comprehensive evaluation of instruction-tuned models spanning 1B to 405B parameters, applying four quantization methods across 13 datasets. Our findings reveal that (1) quantized models generally surpass smaller FP16 baselines, yet they often struggle with instruction-following and hallucination detection; (2) FP8 consistently emerges as the most robust option across tasks, and AWQ tends to outperform GPTQ in weight-only quantization; (3) smaller models can suffer severe accuracy drops at 4-bit quantization, while 70B-scale models maintain stable performance; (4) notably, extit{hard} tasks do not always experience the largest accuracy losses, indicating that quantization magnifies a model's inherent weaknesses rather than simply correlating with task difficulty; and (5) an LLM-based judge (MT-Bench) highlights significant performance declines in coding and STEM tasks, though reasoning may sometimes improve.

Problem

Research questions and friction points this paper is trying to address.

Evaluating quantization impact on model performance across sizes

Comparing FP8 and AWQ effectiveness in weight-only quantization

Assessing accuracy drops in small vs large quantized models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates four quantization methods on 13 datasets

FP8 is most robust across various tasks

70B models maintain stability at 4-bit quantization

🔎 Similar Papers

No similar papers found.

Qualcomm

$140,800.00 - $211,200.00

San Diego, California, United States of America

Natural Language Processing Researcher

Kitware

Clifton Park, New York / Carrboro, North Carolina / Minneapolis, MN

Authors to Follow