Binary Quantization For LLMs Through Dynamic Grouping

📅 2025-09-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high inference resource consumption of large language models (LLMs) and the severe performance degradation inherent in 1-bit binarization, this paper proposes a dynamic grouped binarization quantization method. Our approach introduces: (1) a novel optimization objective specifically designed for binarization; (2) dynamic unstructured submatrix partitioning coupled with adaptive grouping to mitigate weight distribution heterogeneity; and (3) a three-stage optimization algorithm integrating block-wise quantization and single-core parallel processing. Evaluated on LLaMA-3.2-3B, our method achieves an average bit-width of 1.007 bits—effectively near-binary compression—with a perplexity of 8.23 (vs. 7.81 for the full-precision baseline), significantly outperforming existing state-of-the-art binarization techniques. The entire quantization process completes in under 100 minutes.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of Natural Language Processing (NLP) tasks, but require substantial memory and computational resources. Binary quantization, which compresses model weights from 16-bit Brain Float to 1-bit representations in {-1, 1}, offers significant reductions in storage and inference costs. However, such aggressive quantization often leads to notable performance degradation compared to more conservative 4-bit quantization methods. In this research, we propose a novel optimization objective tailored for binary quantization, along with three algorithms designed to realize it effectively. Our method enhances blocked quantization by dynamically identifying optimal unstructured sub-matrices through adaptive grouping strategies. Experimental results demonstrate that our approach achieves an average bit length of just 1.007 bits, while maintaining high model quality. Specifically, our quantized LLaMA 3.2 3B model attains a perplexity of 8.23, remarkably close to the original 7.81, and surpasses previous SOTA BiLLM with a perplexity of only 123.90. Furthermore, our method is competitive with SOTA 4-bit approaches such as GPTQ in both performance and efficiency. The compression process is highly efficient, requiring only 14 seconds to quantize the full LLaMA 3.2 3B weights on a single CPU core, with the entire process completing in under 100 minutes and exhibiting embarrassingly parallel properties. Code - https://github.com/johnnyzheng0636/WGM_bi_quan
Problem

Research questions and friction points this paper is trying to address.

Reducing LLM memory and computational costs via binary quantization
Minimizing performance degradation in 1-bit weight compression
Optimizing dynamic grouping strategies for efficient model quantization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic grouping for optimal sub-matrices identification
Novel optimization objective tailored for binary quantization
Efficient compression with high parallelism and low time
🔎 Similar Papers
Xinzhe Zheng
Xinzhe Zheng
National University of Singapore
AI for BiomedicineAI for Science
Z
Zhen-Qun Yang
Department of Computing, The Hong Kong Polytechnic University, Hong Kong
H
Haoran Xie
School of Data Science, Lingnan University, Hong Kong
S. Joe Qin
S. Joe Qin
Lingnan University, Hong Kong, President, Member of EASA, Fellow of HKAE, NAI, IEEE, IFAC, AIChE
Process data analyticsdata scienceprocess controlsystem identificationprocess monitoring
A
Arlene Chen
Xiaoi Robot Inc., Shanghai, China
Fangzhen Lin
Fangzhen Lin
Unknown affiliation