🤖 AI Summary
Deploying quantized large language models (LLMs) on memory-constrained devices remains challenging due to excessive memory footprint. To address this, we propose a post-quantization two-stage compression framework. Our key contributions are: (1) compressive sensing quantization—novelly enhancing weight compressibility via parameter rescaling; (2) a synergistic, two-phase compression paradigm integrating structured pruning with sparse coding optimization; and (3) a speed-adaptive decompression scheduling mechanism that dynamically balances decompression overhead against inference latency. Experiments across mainstream LLMs demonstrate an average 2.2× model compression ratio, 40% reduction in memory footprint, negligible accuracy degradation (<0.3% absolute drop in evaluation metrics), and no statistically significant degradation in end-to-end inference speed.
📝 Abstract
Large language models (LLMs) exhibit excellent performance in various tasks. However, the memory requirements of LLMs present a great challenge when deploying on memory-limited devices, even for quantized LLMs. This paper introduces a framework to compress LLM after quantization further, achieving about 2.2x compression ratio. A compression-aware quantization is first proposed to enhance model weight compressibility by re-scaling the model parameters before quantization, followed by a pruning method to improve further. Upon this, we notice that decompression can be a bottleneck during practical scenarios. We then give a detailed analysis of the trade-off between memory usage and latency brought by the proposed method. A speed-adaptive method is proposed to overcome it. The experimental results show inference with the compressed model can achieve a 40% reduction in memory size with negligible loss in accuracy and inference speed.