Highly Efficient and Effective LLMs with Multi-Boolean Architectures

📅 2025-05-28

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Existing LLM binarization methods face a fundamental trade-off: post-training binarization is efficient but suffers from severe accuracy degradation, while training-aware approaches achieve better performance yet rely on full-precision latent weights—introducing inaccurate gradient approximations and substantial computational overhead. This work proposes the first end-to-end fine-tuning framework operating entirely within the Boolean domain, eliminating latent variables altogether. We parameterize model weights as multi-kernel Boolean variables, enabling exact gradient propagation and optimization directly in Boolean space. By co-designing multi-Boolean-kernel parameterization, latent-free optimization, and low-bit quantization, our method consistently outperforms state-of-the-art ultra-low-bit approaches across multiple mainstream LLMs. It achieves significant inference speedup and reduces memory footprint to less than 1/32 of the original model, marking the first demonstration of high-fidelity, high-efficiency Boolean-domain adaptation for large language models.

Technology Category

Application Category

📝 Abstract

Weight binarization has emerged as a promising strategy to drastically reduce the complexity of large language models (LLMs). It is mainly classified into two approaches: post-training binarization and finetuning with training-aware binarization methods. The first approach, while having low complexity, leads to significant loss of information from the original LLMs, resulting in poor performance. The second approach, on the other hand, relies heavily on full-precision latent weights for gradient approximation of binary weights, which not only remains suboptimal but also introduces substantial complexity. In this paper, we introduce a novel framework that effectively transforms LLMs into multi-kernel Boolean parameters, for the first time, finetunes them directly in the Boolean domain, eliminating the need for expensive latent weights. This significantly reduces complexity during both finetuning and inference. Through extensive and insightful experiments across a wide range of LLMs, we demonstrate that our method outperforms recent ultra low-bit quantization and binarization methods.

Problem

Research questions and friction points this paper is trying to address.

Reducing LLM complexity via weight binarization without information loss

Eliminating reliance on full-precision latent weights during finetuning

Improving performance over ultra low-bit quantization and binarization methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-kernel Boolean parameters for LLMs

Direct finetuning in Boolean domain

Eliminates need for latent weights

🔎 Similar Papers

No similar papers found.

Qualcomm

$140,800.00 - $211,200.00

San Diego, California, United States of America

Post-Training Platform Infrastructure Engineer

AMD

San Jose, CA (Hybrid) / other US locations

Authors to Follow