π€ AI Summary
This work addresses the challenge that existing binarization methods for large language models struggle to handle the heavy-tailed distribution of activations, hindering efficient end-to-end low-bit inference. To overcome this limitation, the authors propose the BWLA framework, which achieves high-accuracy compression by jointly quantizing weights to 1 bit and activations to low-bit precision (e.g., 6 bits) in a post-training settingβthe first such approach to do so. BWLA introduces an orthogonal Kronecker transform (OKT) and a proximal SVD projection (PSP) to effectively mitigate the adverse effects of activation outliers. Evaluated on Qwen3-32B, the method reduces perplexity on Wikitext2 to 11.92, improves average zero-shot task performance by over 70%, and accelerates inference by 3.26Γ compared to baseline quantization approaches.
π Abstract
Large language models (LLMs) have driven major progress in NLP, yet their substantial memory and compute demands still hinder practical deployment. Binarization can compress weights to 1 bit, fundamentally lowering compute and bandwidth cost. However, existing methods cannot address activation heavy tails and thus must keep activations in high precision, preventing true end-to-end acceleration. To overcome this limitation, we propose BWLA (Binarized Weights and Low-bit Activations), the first post-training quantization framework that preserves high accuracy while achieving 1-bit weight quantization together with low-bit activations (e.g., 6 bits). The Orthogonal-Kronecker Transformation (OKT) learns an orthogonal mapping via EM minimization, converting unimodal weights into symmetric bimodal forms while suppressing activation tails and incoherence. The Proximal SVD Projection (PSP) then performs lightweight low-rank refinement through proximal SVD projection, further enhancing quantizability with minimal overhead. On Qwen3-32B, BWLA reaches a Wikitext2 perplexity of 11.92 under 6-bit activations (vs. 38 from SOTA), improves five zero-shot tasks by more than 70%, and delivers 3.26 times inference speedup, demonstrating strong potential for real-world LLM compression and acceleration.