Towards Efficient LLM Storage Reduction via Tensor Deduplication and Delta Compression

📅 2025-04-30
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) suffer from rapidly escalating storage costs due to redundant fine-tuned variants, while existing deduplication and compression techniques are mutually exclusive and ill-suited to LLM-specific characteristics. Method: This paper introduces zLLM, the first holistic storage-reduction framework for LLMs, which jointly optimizes deduplication and compression. It uncovers, for the first time, structural sparsity and byte-level similarity in parameter differences across LLM families, and synergistically integrates tensor-level content-hashing deduplication with model-aware BitX XOR-based delta compression—enabling orthogonal gains. Contribution/Results: Leveraging LLM clustering, lossless delta encoding, and a unified pipeline, zLLM achieves 49.5% overall storage reduction on the full Hugging Face public model repository—surpassing state-of-the-art methods by over 20 percentage points—while supporting millisecond-scale decompression and seamless integration into inference pipelines.

Technology Category

Application Category

📝 Abstract
Modern model hubs, such as Hugging Face, store tens of petabytes of LLMs, with fine-tuned variants vastly outnumbering base models and dominating storage consumption. Existing storage reduction techniques -- such as deduplication and compression -- are either LLM oblivious or not compatible with each other, limiting data reduction effectiveness. Our large-scale characterization study across all publicly available Hugging Face LLM repositories reveals several key insights: (1) fine-tuned models within the same family exhibit highly structured, sparse parameter differences suitable for delta compression; (2) bitwise similarity enables LLM family clustering; and (3) tensor-level deduplication offers strong synergy with model aware compressors. Building on these insights, we present BitX, an effective, fast, lossless delta compression algorithm that compresses XORed redundancy between fine-tuned and base LLMs. We build zLLM, a model storage reduction pipeline that unifies tensor-level deduplication and lossless BitX compression. By synergizing deduplication and compression around LLM family clustering, zLLM reduces model storage consumption by 49.5 percent, over 20 percent more than state-of-the-art deduplication and compression designs.
Problem

Research questions and friction points this paper is trying to address.

Reduces storage for fine-tuned LLMs via deduplication and compression
Addresses inefficiency of existing LLM-oblivious storage techniques
Leverages structured parameter differences for delta compression
Innovation

Methods, ideas, or system contributions that make the work stand out.

Synergizes tensor-level deduplication with lossless compression
Uses BitX algorithm for efficient delta compression
Clusters LLM families to optimize storage reduction
🔎 Similar Papers
No similar papers found.
Z
Zirui Wang
University of Virginia
Tingfeng Lan
Tingfeng Lan
Department of Computer Science, University of Virginia
ML system
Z
Zhaoyuan Su
University of Virginia
J
Juncheng Yang
Harvard University
Y
Yue Cheng
University of Virginia