Huff-LLM: End-to-End Lossless Compression for Efficient LLM Inference

πŸ“… 2025-02-02
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address memory and bandwidth bottlenecks hindering large language model (LLM) deployment on edge devices, this paper proposes the first end-to-end lossless compression framework tailored for LLMs, enabling compressed storage and direct inference across the full stackβ€”cloud, disk, main memory, and on-chip caches. Methodologically, it integrates weight-distribution-adaptive Huffman coding, support for direct computation in the compressed domain, and memory- and bandwidth-aware weight repartitioning. Key contributions include: (i) strict preservation of original model behavior with zero precision loss; (ii) substantial reduction in weight loading bandwidth and on-chip storage footprint; (iii) improved inference latency and energy efficiency; and (iv) enabling efficient deployment of larger-scale LLMs on resource-constrained edge hardware. Experimental results demonstrate consistent accuracy retention while achieving up to 2.1Γ— bandwidth savings and 1.8Γ— on-chip memory reduction across diverse LLM architectures and edge platforms.

Technology Category

Application Category

πŸ“ Abstract
As they become more capable, large language models (LLMs) have continued to rapidly increase in size. This has exacerbated the difficulty in running state of the art LLMs on small, edge devices. Standard techniques advocate solving this problem through lossy compression techniques such as quantization or pruning. However, such compression techniques are lossy, and have been shown to change model behavior in unpredictable manners. We propose Huff-LLM, an emph{end-to-end, lossless} model compression method that lets users store LLM weights in compressed format emph{everywhere} -- cloud, disk, main memory, and even in on-chip memory/buffers. This allows us to not only load larger models in main memory, but also reduces bandwidth required to load weights on chip, and makes more efficient use of on-chip weight buffers. In addition to the memory savings achieved via compression, we also show latency and energy efficiency improvements when performing inference with the compressed model.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Resource-efficient Deployment
Information Preservation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Huff-LLM
Lossless Compression
Large Language Models
P
Patrick Yubeaton
Department of Electrical and Computer Engineering, New York University, NY, USA
T
Tareq Mahmoud
Department of Computer Science and Engineering, University of Notre Dame, IN, USA
S
Shehab Naga
Department of Computer Science and Engineering, University of Notre Dame, IN, USA
Pooria Taheri
Pooria Taheri
University of Notre Dame
Machine LearningML AcceleratorsHardware-Software CodesignNeuromorphic Computing
Tianhua Xia
Tianhua Xia
New York University
Computer Architecture
A
Arun George
Department of Computer Science and Engineering, University of Notre Dame, IN, USA
Y
Yasmein Khalil
Department of Computer Science and Engineering, University of Notre Dame, IN, USA
Sai Qian Zhang
Sai Qian Zhang
New York University
Chinmay Hegde
Chinmay Hegde
New York University
AI
Siddharth Garg
Siddharth Garg
Institute Associate Professor, New York University
AI/MLHardwareSecurityPrivacy
S
Siddharth Joshi
Department of Computer Science and Engineering, University of Notre Dame, IN, USA