ParetoQ: Scaling Laws in Extremely Low-bit LLM Quantization

📅 2025-02-04

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

This study systematically investigates the model size–accuracy trade-off in 1–4-bit quantization of large language models (LLMs). We propose the first unified evaluation framework enabling strict, apples-to-apples comparisons across 1-bit to 4-bit quantization. Our analysis reveals a critical learning phase transition in the 2–3-bit regime—a previously unreported phenomenon. To address ultra-low-bit challenges, we design an adaptive quantization function and dedicated training strategies tailored for ≤2-bit quantization, and employ Pareto-optimal frontier search to automatically identify the best quantization configuration. Experiments demonstrate that ternary (≈1.58-bit), 2-bit, and 3-bit quantization schemes dominate both binary and 4-bit alternatives in accuracy–model-size trade-offs. Notably, the 2-bit variant reduces parameter count to 20% of the full-precision model while surpassing state-of-the-art baselines. These advances significantly improve hardware deployment efficiency and memory bandwidth utilization.

Technology Category

Application Category

📝 Abstract

The optimal bit-width for achieving the best trade-off between quantized model size and accuracy has been a subject of ongoing debate. While some advocate for 4-bit quantization, others propose that 1.58-bit offers superior results. However, the lack of a cohesive framework for different bits has left such conclusions relatively tenuous. We present ParetoQ, the first unified framework that facilitates rigorous comparisons across 1-bit, 1.58-bit, 2-bit, 3-bit, and 4-bit quantization settings. Our findings reveal a notable learning transition between 2 and 3 bits: For 3-bits and above, the fine-tuned models stay close to their original pre-trained distributions, whereas for learning 2-bit networks or below, the representations change drastically. By optimizing training schemes and refining quantization functions, ParetoQ surpasses all previous methods tailored to specific bit widths. Remarkably, our ParetoQ ternary 600M-parameter model even outperforms the previous SoTA ternary 3B-parameter model in accuracy, using only one-fifth of the parameters. Extensive experimentation shows that ternary, 2-bit, and 3-bit quantization maintains comparable performance in the size-accuracy trade-off and generally exceeds 4-bit and binary quantization. Considering hardware constraints, 2-bit quantization offers promising potential for memory reduction and speedup.

Problem

Research questions and friction points this paper is trying to address.

Optimal bit-width for model size-accuracy trade-off

Unified framework for low-bit quantization comparison

ParetoQ outperforms previous quantization methods in accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified framework for multi-bit quantization

Optimized training and quantization functions

Ternary model outperforms larger models

🔎 Similar Papers

An empirical study of LLaMA3 quantization: from LLMs to MLLMs