BCQ: Block Clustered Quantization for 4-bit (W4A4) LLM Inference

📅 2025-02-07

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Achieving high-fidelity joint 4-bit quantization of weights and activations (W4A4) for large language models (LLMs) remains challenging under post-training quantization (PTQ), especially without quantization-aware training (QAT). Method: This paper proposes Block-wise Clustering Quantization (BCQ), a novel paradigm that partitions tensors into contiguous blocks, clusters blocks based on statistical characteristics, and learns locally optimal 4-bit symmetric codebooks per cluster. We further introduce LO-BCQ, an iterative optimization algorithm integrating block decomposition, K-means clustering, mean-squared-error-driven alternating optimization, and lightweight scaling-factor encoding. Results: Evaluated across multiple mainstream LLMs and downstream tasks, BCQ achieves <1% accuracy degradation under W4A4 quantization while incurring only 0.5-bit overhead—setting a new state-of-the-art for PTQ-based joint 4-bit quantization.

Technology Category

Application Category

📝 Abstract

Post-training quantization (PTQ) is a promising approach to reducing the storage and computational requirements of large language models (LLMs) without additional training cost. Recent PTQ studies have primarily focused on quantizing only weights to sub-8-bits while maintaining activations at 8-bits or higher. Accurate sub-8-bit quantization for both weights and activations without relying on quantization-aware training remains a significant challenge. We propose a novel quantization method called block clustered quantization (BCQ) wherein each operand tensor is decomposed into blocks (a block is a group of contiguous scalars), blocks are clustered based on their statistics, and a dedicated optimal quantization codebook is designed for each cluster. As a specific embodiment of this approach, we propose a PTQ algorithm called Locally-Optimal BCQ (LO-BCQ) that iterates between the steps of block clustering and codebook design to greedily minimize the quantization mean squared error. When weight and activation scalars are encoded to W4A4 format (with 0.5-bits of overhead for storing scaling factors and codebook selectors), we advance the current state-of-the-art by demonstrating<1% loss in inference accuracy across several LLMs and downstream tasks.

Problem

Research questions and friction points this paper is trying to address.

4-bit quantization

large language models

post-training quantization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Block Clustered Quantization method

Post-training quantization algorithm

W4A4 format optimization

🔎 Similar Papers

CBQ: Cross-Block Quantization for Large Language Models