Lossless Compression for LLM Tensor Incremental Snapshots

📅 2025-05-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language model (LLM) training incurs severe I/O and network bandwidth pressure due to massive checkpoint tensors—often hundreds of gigabytes—posing storage and transmission bottlenecks. To address this, we propose LMC, a lossless tensor compressor specifically designed for LLM checkpoints. LMC integrates byte-level grouping, incremental differencing, and Huffman coding, accelerated via multi-threaded parallelism to achieve high compression ratios with minimal latency. Crucially, this work presents the first systematic characterization of the dynamic evolution of tensor compressibility throughout LLM training. Experimental evaluation on 16-core systems shows LMC attains compression and decompression throughputs of 2.78 GiB/s and 3.76 GiB/s, respectively—significantly outperforming BZ2—while drastically reducing CPU overhead. This enables more frequent checkpointing, effectively alleviating storage capacity and network bandwidth constraints in large-scale LLM training.

Technology Category

Application Category

📝 Abstract
During the training of Large Language Models (LLMs), tensor data is periodically"checkpointed"to persistent storage to allow recovery of work done in the event of failure. The volume of data that must be copied during each checkpoint, even when using reduced-precision representations such as bfloat16, often reaches hundreds of gigabytes. Furthermore, the data must be moved across a network and written to a storage system before the next epoch occurs. With a view to ultimately building an optimized checkpointing solution, this paper presents experimental analysis of checkpoint data used to derive a design that maximizes the use of lossless compression to reduce the volume of data. We examine how tensor data and its compressibility evolve during model training and evaluate the efficacy of existing common off-the-shelf general purpose compression engines combined with known data optimization techniques such as byte-grouping and incremental delta compression. Leveraging our analysis we have built an effective compression solution, known as Language Model Compressor (LMC), which is based on byte-grouping and Huffman encoding. LMC offers more compression performance than the best alternative (BZ2) but with an order-of-magnitude reduction in the time needed to perform the compression. We show that a 16-core parallel implementation of LMC can attain compression and decompression throughput of 2.78 GiB/s and 3.76 GiB/s respectively. This increase in performance ultimately reduces the CPU resources needed and provides more time to copy the data to the storage system before the next epoch thus allowing for higher-frequency checkpoints.
Problem

Research questions and friction points this paper is trying to address.

Reducing LLM checkpoint data volume via lossless compression
Optimizing tensor data compression during model training
Improving checkpoint speed with efficient parallel compression
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses byte-grouping and Huffman encoding for compression
Achieves faster compression than BZ2 with LMC
Enables higher-frequency checkpoints via parallel processing
🔎 Similar Papers
No similar papers found.
Daniel Waddington
Daniel Waddington
IBM Research, Almaden
multicorepersistent memoryhigh-performance distributed computing
C
Cornel Constantinescu
IBM Research, Almaden Research Center, CA, USA