BWLA: Breaking the Barrier of W1AX Post-Training Quantization for LLMs

📅 2026-05-01

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This work addresses the challenge that existing binarization methods for large language models struggle to handle the heavy-tailed distribution of activations, hindering efficient end-to-end low-bit inference. To overcome this limitation, the authors propose the BWLA framework, which achieves high-accuracy compression by jointly quantizing weights to 1 bit and activations to low-bit precision (e.g., 6 bits) in a post-training setting—the first such approach to do so. BWLA introduces an orthogonal Kronecker transform (OKT) and a proximal SVD projection (PSP) to effectively mitigate the adverse effects of activation outliers. Evaluated on Qwen3-32B, the method reduces perplexity on Wikitext2 to 11.92, improves average zero-shot task performance by over 70%, and accelerates inference by 3.26× compared to baseline quantization approaches.

📝 Abstract

Large language models (LLMs) have driven major progress in NLP, yet their substantial memory and compute demands still hinder practical deployment. Binarization can compress weights to 1 bit, fundamentally lowering compute and bandwidth cost. However, existing methods cannot address activation heavy tails and thus must keep activations in high precision, preventing true end-to-end acceleration. To overcome this limitation, we propose BWLA (Binarized Weights and Low-bit Activations), the first post-training quantization framework that preserves high accuracy while achieving 1-bit weight quantization together with low-bit activations (e.g., 6 bits). The Orthogonal-Kronecker Transformation (OKT) learns an orthogonal mapping via EM minimization, converting unimodal weights into symmetric bimodal forms while suppressing activation tails and incoherence. The Proximal SVD Projection (PSP) then performs lightweight low-rank refinement through proximal SVD projection, further enhancing quantizability with minimal overhead. On Qwen3-32B, BWLA reaches a Wikitext2 perplexity of 11.92 under 6-bit activations (vs. 38 from SOTA), improves five zero-shot tasks by more than 70%, and delivers 3.26 times inference speedup, demonstrating strong potential for real-world LLM compression and acceleration.

Problem

Research questions and friction points this paper is trying to address.

post-training quantization

large language models

1-bit weights

activation quantization

model compression

Innovation

Methods, ideas, or system contributions that make the work stand out.

Post-Training Quantization

Binarized Weights

Low-bit Activations