🤖 AI Summary
When deploying large language models (LLMs) on resource-constrained devices, naively combining binary quantization with semi-structured pruning leads to severe accuracy degradation. To address this, this paper proposes a progressive co-compression framework. Its core contributions are: (1) staged pruning-and-binarization optimization (SPBO), enabling error-controlled joint compression; (2) a coarse-to-fine search (CFS) strategy that adaptively determines layer-wise sparsity patterns and binarization parameters; and (3) integration of gradient-aware error compensation with hardware-friendly semi-structured sparsity constraints. Evaluated across multiple LLM families and standard benchmarks, our method significantly outperforms state-of-the-art post-training binary quantization (PTQ) approaches—achieving higher accuracy while enabling efficient hardware deployment. Notably, it is the first to demonstrate that the total quantization error after co-compression can be *lower* than that of pure binarization, validating the synergistic benefits of coordinated pruning and quantization.
📝 Abstract
Large language models (LLMs) have achieved remarkable success in natural language processing tasks, but their high computational and memory demands pose challenges for deployment on resource-constrained devices. Binarization, as an efficient compression method that reduces model weights to just 1 bit, significantly lowers both computational and memory requirements. Despite this, the binarized LLM still contains redundancy, which can be further compressed. Semi-structured pruning provides a promising approach to achieve this, which offers a better trade-off between model performance and hardware efficiency. However, simply combining binarization with semi-structured pruning can lead to a significant performance drop. To address this issue, we propose a Progressive Binarization with Semi-Structured Pruning (PBS$^2$P) method for LLM compression. We first propose a Stepwise semi-structured Pruning with Binarization Optimization (SPBO). Our optimization strategy significantly reduces the total error caused by pruning and binarization, even below that of the no-pruning scenario. Furthermore, we design a Coarse-to-Fine Search (CFS) method to select pruning elements more effectively. Extensive experiments demonstrate that PBS$^2$P achieves superior accuracy across various LLM families and evaluation metrics, noticeably outperforming state-of-the-art (SOTA) binary PTQ methods. The code and models will be available at https://github.com/XIANGLONGYAN/PBS2P.