Highly Efficient and Effective LLMs with Multi-Boolean Architectures

📅 2025-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM binarization methods face a fundamental trade-off: post-training binarization is efficient but suffers from severe accuracy degradation, while training-aware approaches achieve better performance yet rely on full-precision latent weights—introducing inaccurate gradient approximations and substantial computational overhead. This work proposes the first end-to-end fine-tuning framework operating entirely within the Boolean domain, eliminating latent variables altogether. We parameterize model weights as multi-kernel Boolean variables, enabling exact gradient propagation and optimization directly in Boolean space. By co-designing multi-Boolean-kernel parameterization, latent-free optimization, and low-bit quantization, our method consistently outperforms state-of-the-art ultra-low-bit approaches across multiple mainstream LLMs. It achieves significant inference speedup and reduces memory footprint to less than 1/32 of the original model, marking the first demonstration of high-fidelity, high-efficiency Boolean-domain adaptation for large language models.

Technology Category

Application Category

📝 Abstract
Weight binarization has emerged as a promising strategy to drastically reduce the complexity of large language models (LLMs). It is mainly classified into two approaches: post-training binarization and finetuning with training-aware binarization methods. The first approach, while having low complexity, leads to significant loss of information from the original LLMs, resulting in poor performance. The second approach, on the other hand, relies heavily on full-precision latent weights for gradient approximation of binary weights, which not only remains suboptimal but also introduces substantial complexity. In this paper, we introduce a novel framework that effectively transforms LLMs into multi-kernel Boolean parameters, for the first time, finetunes them directly in the Boolean domain, eliminating the need for expensive latent weights. This significantly reduces complexity during both finetuning and inference. Through extensive and insightful experiments across a wide range of LLMs, we demonstrate that our method outperforms recent ultra low-bit quantization and binarization methods.
Problem

Research questions and friction points this paper is trying to address.

Reducing LLM complexity via weight binarization without information loss
Eliminating reliance on full-precision latent weights during finetuning
Improving performance over ultra low-bit quantization and binarization methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-kernel Boolean parameters for LLMs
Direct finetuning in Boolean domain
Eliminates need for latent weights
🔎 Similar Papers
No similar papers found.
Ba-Hien Tran
Ba-Hien Tran
Huawei Paris Research Center
Bayesian InferenceMachine LearningGenerative ModelsDeep LearningEfficient AI
V
Van Minh Nguyen
Mathematical and Algorithmic Sciences Laboratory, Huawei Paris Research Center