🤖 AI Summary
Existing LLM binarization methods face a fundamental trade-off: post-training binarization is efficient but suffers from severe accuracy degradation, while training-aware approaches achieve better performance yet rely on full-precision latent weights—introducing inaccurate gradient approximations and substantial computational overhead. This work proposes the first end-to-end fine-tuning framework operating entirely within the Boolean domain, eliminating latent variables altogether. We parameterize model weights as multi-kernel Boolean variables, enabling exact gradient propagation and optimization directly in Boolean space. By co-designing multi-Boolean-kernel parameterization, latent-free optimization, and low-bit quantization, our method consistently outperforms state-of-the-art ultra-low-bit approaches across multiple mainstream LLMs. It achieves significant inference speedup and reduces memory footprint to less than 1/32 of the original model, marking the first demonstration of high-fidelity, high-efficiency Boolean-domain adaptation for large language models.
📝 Abstract
Weight binarization has emerged as a promising strategy to drastically reduce the complexity of large language models (LLMs). It is mainly classified into two approaches: post-training binarization and finetuning with training-aware binarization methods. The first approach, while having low complexity, leads to significant loss of information from the original LLMs, resulting in poor performance. The second approach, on the other hand, relies heavily on full-precision latent weights for gradient approximation of binary weights, which not only remains suboptimal but also introduces substantial complexity. In this paper, we introduce a novel framework that effectively transforms LLMs into multi-kernel Boolean parameters, for the first time, finetunes them directly in the Boolean domain, eliminating the need for expensive latent weights. This significantly reduces complexity during both finetuning and inference. Through extensive and insightful experiments across a wide range of LLMs, we demonstrate that our method outperforms recent ultra low-bit quantization and binarization methods.