Squeeze10-LLM: Squeezing LLMs' Weights by 10 Times via a Staged Mixed-Precision Quantization Method

📅 2025-07-23

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

To address the high storage and computational overhead in large language model (LLM) deployment, this paper proposes a stage-wise hybrid-precision post-training quantization (PTQ) method, achieving a 10× model compression with an average weight bitwidth of 1.6 bits (80% 1-bit and 20% 4-bit). We introduce two novel mechanisms: Post-Binarization Activation Robustness evaluation (PBAR) and Full-Information Activation Supervision (FIAS), which jointly mitigate error propagation under ultra-low-bit quantization and enhance activation robustness. Evaluated on the LLaMA family, our method boosts average zero-shot classification accuracy across six benchmarks from 43% to 56%, establishing a new state-of-the-art for sub-2-bit weight quantization. The approach achieves an unprecedented balance between extreme model compression and maintained inference accuracy.

Technology Category

Application Category

📝 Abstract

Deploying large language models (LLMs) is challenging due to their massive parameters and high computational costs. Ultra low-bit quantization can significantly reduce storage and accelerate inference, but extreme compression (i.e., mean bit-width <= 2) often leads to severe performance degradation. To address this, we propose Squeeze10-LLM, effectively "squeezing" 16-bit LLMs' weights by 10 times. Specifically, Squeeze10-LLM is a staged mixed-precision post-training quantization (PTQ) framework and achieves an average of 1.6 bits per weight by quantizing 80% of the weights to 1 bit and 20% to 4 bits. We introduce Squeeze10LLM with two key innovations: Post-Binarization Activation Robustness (PBAR) and Full Information Activation Supervision (FIAS). PBAR is a refined weight significance metric that accounts for the impact of quantization on activations, improving accuracy in low-bit settings. FIAS is a strategy that preserves full activation information during quantization to mitigate cumulative error propagation across layers. Experiments on LLaMA and LLaMA2 show that Squeeze10-LLM achieves state-of-the-art performance for sub-2bit weight-only quantization, improving average accuracy from 43% to 56% on six zero-shot classification tasks--a significant boost over existing PTQ methods. Our code will be released upon publication.

Problem

Research questions and friction points this paper is trying to address.

Reducing LLM weights size by 10x via quantization

Minimizing performance loss in ultra low-bit compression

Improving accuracy in sub-2bit weight-only quantization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Staged mixed-precision post-training quantization framework

Post-Binarization Activation Robustness (PBAR) metric

Full Information Activation Supervision (FIAS) strategy

🔎 Similar Papers

No similar papers found.