AnyBCQ: Hardware Efficient Flexible Binary-Coded Quantization for Multi-Precision LLMs

📅 2025-10-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Deploying large language models (LLMs) faces dual bottlenecks of memory footprint and latency, necessitating hardware-efficient multi-precision quantization with dynamic precision switching. This paper proposes AnyBCQ, a hardware-aware multi-precision quantization framework built upon bit-plane representation and Binary Coding Quantization (BCQ). Its core innovation is a progressive precision expansion mechanism: it reuses the binary bit-plane structure and incrementally optimizes per-group scaling factors to enable request-level dynamic precision selection. Concurrently, it introduces a bit-parallel compute kernel that significantly reduces hardware overhead. Experiments demonstrate that AnyBCQ substantially mitigates accuracy degradation at ultra-low bit-widths (e.g., 2-bit), while outperforming FP16 and state-of-the-art methods at higher precisions. It achieves up to 3.0× higher throughput than FP16 and 1.2× over the SOTA, with minimal accuracy loss across precision levels.

Technology Category

Application Category

📝 Abstract
The deployment of large language models (LLMs) is increasingly constrained by memory and latency bottlenecks, motivating the need for quantization techniques that flexibly balance accuracy and efficiency. Recent work has introduced multi-precision models, which enable inference at multiple precisions within a single model depending on runtime constraints. To support such flexibility, quantized weights are often stored as bit-planes, where hardware efficiency improves when the compute operates directly at the bit-plane level and activates only the precision required by each request. In this work, we present AnyBCQ, a hardware-friendly multi-precision extension of Binary-Coded Quantization (BCQ) that supports direct bit-plane operations. By representing weights as binary bit-planes with corresponding scale factors, AnyBCQ enables bit-plane-level computation and maps naturally to accelerator-friendly, bit-parallel arithmetic. Our progressive precision expansion mechanism incrementally refines scaling factors while reusing previously assigned binary codes, yielding monotonic improvements in accuracy as additional bits are enabled. We further co-design a specialized kernel that exploits the BCQ structure to support dynamic per-request precision selection with negligible overhead. Experiments on recent LLMs demonstrate that AnyBCQ significantly narrows the accuracy drop in the low-bit regime (e.g. 2-bit), remains competitive at higher precision, and achieves throughput gains of up to 3.0x over half precision and 1.2x over state-of-the-art multi-precision methods. By aligning algorithmic flexibility with hardware efficiency, AnyBCQ provides a practical foundation for multi-precision LLM deployment across diverse service-level objectives.
Problem

Research questions and friction points this paper is trying to address.

Addresses memory and latency bottlenecks in LLM deployment
Enables flexible multi-precision inference within single models
Supports hardware-efficient bit-plane operations for quantization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Binary bit-plane representation for flexible precision
Progressive scaling mechanism reusing binary codes
Specialized kernel enabling dynamic precision selection
🔎 Similar Papers
No similar papers found.