RaBiT: Residual-Aware Binarization Training for Accurate and Efficient LLMs

📅 2026-02-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Under extreme quantization, large language models suffer performance degradation due to feature redundancy and co-adaptation among residual binary paths. To address this, this work proposes RaBiT, a framework that introduces a sequential derivation mechanism during quantization-aware training: each binary path is sequentially derived from shared full-precision weights, progressively correcting predecessor errors and explicitly constructing a hierarchical residual structure. This approach eliminates heuristic path-freezing strategies and integrates residual-aware binarization with a robust initialization scheme that prioritizes functional preservation, substantially enhancing model expressiveness and quantization stability. Experiments demonstrate that RaBiT achieves state-of-the-art performance at 2-bit precision, matching the accuracy of high-overhead vector quantization methods while delivering 4.49× faster inference than full-precision models on an RTX 4090 GPU.

Technology Category

Application Category

📝 Abstract
Efficient deployment of large language models (LLMs) requires extreme quantization, forcing a critical trade-off between low-bit efficiency and performance. Residual binarization enables hardware-friendly, matmul-free inference by stacking binary ($\pm$1) layers, but is plagued by pathological feature co-adaptation. We identify a key failure mode, which we term inter-path adaptation: during quantization-aware training (QAT), parallel residual binary paths learn redundant features, degrading the error-compensation structure and limiting the expressive capacity of the model. While prior work relies on heuristic workarounds (e.g., path freezing) that constrain the solution space, we propose RaBiT, a novel quantization framework that resolves co-adaptation by algorithmically enforcing a residual hierarchy. Its core mechanism sequentially derives each binary path from a single shared full-precision weight, which ensures that every path corrects the error of the preceding one. This process is stabilized by a robust initialization that prioritizes functional preservation over mere weight approximation. RaBiT redefines the 2-bit accuracy-efficiency frontier: it achieves state-of-the-art performance, rivals even hardware-intensive Vector Quantization (VQ) methods, and delivers a $4.49\times$ inference speed-up over full-precision models on an RTX 4090.
Problem

Research questions and friction points this paper is trying to address.

binary quantization
large language models
residual binarization
feature co-adaptation
quantization-aware training
Innovation

Methods, ideas, or system contributions that make the work stand out.

residual binarization
quantization-aware training
feature co-adaptation
binary neural networks
efficient LLMs
🔎 Similar Papers
No similar papers found.
Y
Youngcheon You
Samsung Research, Seoul, Korea
B
Banseok Lee
Samsung Research, Seoul, Korea
M
Minseop Choi
Samsung Research, Seoul, Korea
S
Seonyoung Kim
Samsung Research, Seoul, Korea
H
Hyochan Chong
Samsung Research, Seoul, Korea
C
Changdong Kim
Samsung Research, Seoul, Korea
Youngmin Kim
Youngmin Kim
School of Electronic and Electrical Engineering, Hongik University, Seoul, Korea
CAD&VLSIDFMEmbedded-systemsIntegrated Circuit Designs
D
Dongkyu Kim
Samsung Research, Seoul, Korea