ZipServ: Fast and Memory-Efficient LLM Inference with Hardware-Aware Lossless Compression

πŸ“… 2026-03-18
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the memory and bandwidth bottlenecks in large language model inference caused by massive model sizes, as well as the performance degradation induced by existing lossless compression methods on GPUs. The authors propose a GPU hardware-aware lossless compression framework featuring two key innovations: Tensor Core–aware Triple Bitmap Encoding (TCA-TBE), which enables constant-time parallel decoding, and a fused decompression-and-GEMM kernel, ZipGEMM, that performs computations directly in registers to eliminate intermediate buffers. This approach is the first to simultaneously achieve storage compression and inference acceleration without sacrificing precision, reducing model size by up to 30%, delivering up to 2.21Γ— speedup over cuBLAS at the kernel level, and achieving an average end-to-end inference speedup of 1.22Γ— compared to vLLM.

Technology Category

Application Category

πŸ“ Abstract
Lossless model compression holds tremendous promise for alleviating the memory and bandwidth bottlenecks in bit-exact Large Language Model (LLM) serving. However, existing approaches often result in substantial inference slowdowns due to fundamental design mismatches with GPU architectures: at the kernel level, variable-length bitstreams produced by traditional entropy codecs break SIMT parallelism; at the system level, decoupled pipelines lead to redundant memory traffic. We present ZipServ, a lossless compression framework co-designed for efficient LLM inference. ZipServ introduces Tensor-Core-Aware Triple Bitmap Encoding (TCA-TBE), a novel fixed-length format that enables constant-time, parallel decoding, together with a fused decompression-GEMM (ZipGEMM) kernel that decompresses weights on-the-fly directly into Tensor Core registers. This "load-compressed, compute-decompressed" design eliminates intermediate buffers and maximizes compute intensity. Experiments show that ZipServ reduces the model size by up to 30%, achieves up to 2.21x kernel-level speedup over NVIDIA's cuBLAS, and expedites end-to-end inference by an average of 1.22x over vLLM. ZipServ is the first lossless compression system that provides both storage savings and substantial acceleration for LLM inference on GPUs.
Problem

Research questions and friction points this paper is trying to address.

lossless compression
LLM inference
memory bottleneck
GPU architecture
inference slowdown
Innovation

Methods, ideas, or system contributions that make the work stand out.

lossless compression
LLM inference
Tensor Core
fixed-length encoding
fused kernel
πŸ”Ž Similar Papers
No similar papers found.