When Compression Meets Model Compression: Memory-Efficient Double Compression for Large Language Models

📅 2025-02-21
🏛️ Conference on Empirical Methods in Natural Language Processing
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Deploying quantized large language models (LLMs) on memory-constrained devices remains challenging due to excessive memory footprint. To address this, we propose a post-quantization two-stage compression framework. Our key contributions are: (1) compressive sensing quantization—novelly enhancing weight compressibility via parameter rescaling; (2) a synergistic, two-phase compression paradigm integrating structured pruning with sparse coding optimization; and (3) a speed-adaptive decompression scheduling mechanism that dynamically balances decompression overhead against inference latency. Experiments across mainstream LLMs demonstrate an average 2.2× model compression ratio, 40% reduction in memory footprint, negligible accuracy degradation (<0.3% absolute drop in evaluation metrics), and no statistically significant degradation in end-to-end inference speed.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) exhibit excellent performance in various tasks. However, the memory requirements of LLMs present a great challenge when deploying on memory-limited devices, even for quantized LLMs. This paper introduces a framework to compress LLM after quantization further, achieving about 2.2x compression ratio. A compression-aware quantization is first proposed to enhance model weight compressibility by re-scaling the model parameters before quantization, followed by a pruning method to improve further. Upon this, we notice that decompression can be a bottleneck during practical scenarios. We then give a detailed analysis of the trade-off between memory usage and latency brought by the proposed method. A speed-adaptive method is proposed to overcome it. The experimental results show inference with the compressed model can achieve a 40% reduction in memory size with negligible loss in accuracy and inference speed.
Problem

Research questions and friction points this paper is trying to address.

Memory-efficient compression for LLMs
Compression-aware quantization and pruning
Balancing memory usage and latency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Compression-aware quantization enhances compressibility
Pruning method further reduces model size
Speed-adaptive method balances memory and latency
🔎 Similar Papers
No similar papers found.