QStore: Quantization-Aware Compressed Model Storage

📅 2025-05-07

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

Storing large language models (LLMs) in multiple precisions—e.g., both high-precision (BF16) and low-precision (INT8) variants—incurs significant storage redundancy; storing only high-precision weights, conversely, imposes runtime overhead from dynamic quantization. Method: This paper proposes a unified lossless compression format and introduces the first quantization-aware joint storage paradigm: using a low-precision model as the base, and storing only lightweight residual encodings to enable zero-redundancy co-reconstruction of both high- and low-precision parameters. The approach comprises residual quantization encoding, alignment-aware compact serialization, and efficient decoding logic. Results: Experiments show that our method reduces storage footprint by 55% (up to 2.2× compression) versus conventional separate storage, accelerates model saving and loading by 1.7× and 1.8×, respectively, and incurs no additional inference latency.

Technology Category

Application Category

📝 Abstract

Modern applications commonly leverage large, multi-modal foundation models. These applications often feature complex workflows that demand the storage and usage of similar models in multiple precisions. A straightforward approach is to maintain a separate file for each model precision (e.g., INT8, BF16), which is indeed the approach taken by many model providers such as HuggingFace and Ollama. However, this approach incurs excessive storage costs since a higher precision model (e.g., BF16) is a strict superset of a lower precision model (e.g., INT8) in terms of information. Unfortunately, simply maintaining only the higher-precision model and requiring every user to dynamically convert the model precision is not desirable because every user of lower precision models must pay the cost for model download and precision conversion. In this paper, we present QStore, a unified, lossless compression format for simultaneously storing a model in two (high and low) precisions efficiently. Instead of storing low-precision and high-precision models separately, QStore stores low-precision model and only the residual information needed to reconstruct high-precision models. The size of residual information is significantly smaller than the original high-precision models, thus achieving high savings in storage cost. Moreover, QStore does not compromise the speed of model loading. The low-precision models can be loaded quickly just like before. The high-precision models can also be reconstructed efficiently in memory by merging low-precision data and the residual with QStore's lightweight decoding logic. We evaluate QStore for compressing multiple precisions of popular foundation models, and show that QStore reduces overall storage footprint by up to 2.2x (45% of the original size) while enabling up to 1.7x and 1.8x faster model saving and loading versus existing approaches.

Problem

Research questions and friction points this paper is trying to address.

Reduces storage costs for multi-precision models

Enables efficient model loading without conversion delays

Unifies high and low precision models in one format

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified lossless compression for multi-precision models

Stores low-precision model plus residual for reconstruction

Lightweight decoding enables efficient high-precision loading

🔎 Similar Papers

Effective Interplay between Sparsity and Quantization: From Theory to Practice