π€ AI Summary
To address memory and bandwidth bottlenecks hindering large language model (LLM) deployment on edge devices, this paper proposes the first end-to-end lossless compression framework tailored for LLMs, enabling compressed storage and direct inference across the full stackβcloud, disk, main memory, and on-chip caches. Methodologically, it integrates weight-distribution-adaptive Huffman coding, support for direct computation in the compressed domain, and memory- and bandwidth-aware weight repartitioning. Key contributions include: (i) strict preservation of original model behavior with zero precision loss; (ii) substantial reduction in weight loading bandwidth and on-chip storage footprint; (iii) improved inference latency and energy efficiency; and (iv) enabling efficient deployment of larger-scale LLMs on resource-constrained edge hardware. Experimental results demonstrate consistent accuracy retention while achieving up to 2.1Γ bandwidth savings and 1.8Γ on-chip memory reduction across diverse LLM architectures and edge platforms.
π Abstract
As they become more capable, large language models (LLMs) have continued to rapidly increase in size. This has exacerbated the difficulty in running state of the art LLMs on small, edge devices. Standard techniques advocate solving this problem through lossy compression techniques such as quantization or pruning. However, such compression techniques are lossy, and have been shown to change model behavior in unpredictable manners. We propose Huff-LLM, an emph{end-to-end, lossless} model compression method that lets users store LLM weights in compressed format emph{everywhere} -- cloud, disk, main memory, and even in on-chip memory/buffers. This allows us to not only load larger models in main memory, but also reduces bandwidth required to load weights on chip, and makes more efficient use of on-chip weight buffers. In addition to the memory savings achieved via compression, we also show latency and energy efficiency improvements when performing inference with the compressed model.