Unifying Block-wise PTQ and Distillation-based QAT for Progressive Quantization toward 2-bit Instruction-Tuned LLMs

📅 2025-06-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of deploying instruction-tuned large language models (LLMs) on resource-constrained devices, this paper proposes UPQ—the first end-to-end 2-bit integer quantization framework, enabling progressive quantization from FP16 → INT4 → INT2. Methodologically, UPQ introduces a novel synergistic paradigm unifying block-wise post-training quantization (PTQ) and Jensen–Shannon divergence-driven distillation-based quantization-aware training (QAT), all without requiring private fine-tuning data. Our key contributions are threefold: (1) the first open-source 2-bit quantization solution specifically designed for instruction-tuned LLMs; (2) state-of-the-art 2-bit performance on MMLU and IFEval—surpassing comparable 4-bit models in accuracy; and (3) substantial reductions in memory footprint and inference latency. UPQ thus enables efficient, high-fidelity deployment of instruction-tuned LLMs on edge devices while preserving functional capability.

Technology Category

Application Category

📝 Abstract
As the rapid scaling of large language models (LLMs) poses significant challenges for deployment on resource-constrained devices, there is growing interest in extremely low-bit quantization, such as 2-bit. Although prior works have shown that 2-bit large models are pareto-optimal over their 4-bit smaller counterparts in both accuracy and latency, these advancements have been limited to pre-trained LLMs and have not yet been extended to instruction-tuned models. To bridge this gap, we propose Unified Progressive Quantization (UPQ)$-$a novel progressive quantization framework (FP16$ ightarrow$INT4$ ightarrow$INT2) that unifies block-wise post-training quantization (PTQ) with distillation-based quantization-aware training (Distill-QAT) for INT2 instruction-tuned LLM quantization. UPQ first quantizes FP16 instruction-tuned models to INT4 using block-wise PTQ to significantly reduce the quantization error introduced by subsequent INT2 quantization. Next, UPQ applies Distill-QAT to enable INT2 instruction-tuned LLMs to generate responses consistent with their original FP16 counterparts by minimizing the generalized Jensen-Shannon divergence (JSD) between the two. To the best of our knowledge, we are the first to demonstrate that UPQ can quantize open-source instruction-tuned LLMs to INT2 without relying on proprietary post-training data, while achieving state-of-the-art performances on MMLU and IFEval$-$two of the most representative benchmarks for evaluating instruction-tuned LLMs.
Problem

Research questions and friction points this paper is trying to address.

Quantize instruction-tuned LLMs to 2-bit efficiently
Unify block-wise PTQ and Distill-QAT for progressive quantization
Achieve state-of-the-art performance without proprietary data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified Progressive Quantization (UPQ) framework
Combines block-wise PTQ and Distill-QAT
Quantizes instruction-tuned LLMs to 2-bit
🔎 Similar Papers
No similar papers found.
J
Jung Hyun Lee
Qualcomm AI Research
Seungjae Shin
Seungjae Shin
Qualcomm AI Research
Machine LearningModel CompressionModel Quantization
V
Vinnam Kim
Qualcomm AI Research
J
Jaeseong You
Qualcomm AI Research
A
An Chen
Qualcomm AI Research