🤖 AI Summary
Large language model (LLM) training incurs severe I/O and network bandwidth pressure due to massive checkpoint tensors—often hundreds of gigabytes—posing storage and transmission bottlenecks. To address this, we propose LMC, a lossless tensor compressor specifically designed for LLM checkpoints. LMC integrates byte-level grouping, incremental differencing, and Huffman coding, accelerated via multi-threaded parallelism to achieve high compression ratios with minimal latency. Crucially, this work presents the first systematic characterization of the dynamic evolution of tensor compressibility throughout LLM training. Experimental evaluation on 16-core systems shows LMC attains compression and decompression throughputs of 2.78 GiB/s and 3.76 GiB/s, respectively—significantly outperforming BZ2—while drastically reducing CPU overhead. This enables more frequent checkpointing, effectively alleviating storage capacity and network bandwidth constraints in large-scale LLM training.
📝 Abstract
During the training of Large Language Models (LLMs), tensor data is periodically"checkpointed"to persistent storage to allow recovery of work done in the event of failure. The volume of data that must be copied during each checkpoint, even when using reduced-precision representations such as bfloat16, often reaches hundreds of gigabytes. Furthermore, the data must be moved across a network and written to a storage system before the next epoch occurs. With a view to ultimately building an optimized checkpointing solution, this paper presents experimental analysis of checkpoint data used to derive a design that maximizes the use of lossless compression to reduce the volume of data. We examine how tensor data and its compressibility evolve during model training and evaluate the efficacy of existing common off-the-shelf general purpose compression engines combined with known data optimization techniques such as byte-grouping and incremental delta compression. Leveraging our analysis we have built an effective compression solution, known as Language Model Compressor (LMC), which is based on byte-grouping and Huffman encoding. LMC offers more compression performance than the best alternative (BZ2) but with an order-of-magnitude reduction in the time needed to perform the compression. We show that a 16-core parallel implementation of LMC can attain compression and decompression throughput of 2.78 GiB/s and 3.76 GiB/s respectively. This increase in performance ultimately reduces the CPU resources needed and provides more time to copy the data to the storage system before the next epoch thus allowing for higher-frequency checkpoints.