π€ AI Summary
Training large language models (LLMs) with parameter counts exceeding aggregate multi-GPU memory capacity faces severe bottlenecks, as existing asynchronous multi-level offloading techniques incur substantial I/O overhead, significantly slowing iteration throughput. Method: We propose a multi-level concurrent-control offloading architecture that enables efficient caching and parallel transmission of optimizer states during both backward propagation and parameter update phases. Our approach integrates host-memoryβdisk collaborative offloading, cache-aware optimizer state management, and asynchronous I/O scheduling to fully utilize remote storage bandwidth and mitigate I/O contention. Contribution/Results: Evaluated on a 280B-parameter model, our system achieves a 2.5Γ speedup in training iteration time over state-of-the-art systems. It effectively breaks the GPU memory barrier and substantially improves training efficiency under resource-constrained conditions.
π Abstract
Training LLMs larger than the aggregated memory of multiple GPUs is increasingly necessary due to the faster growth of LLM sizes compared to GPU memory. To this end, multi-tier host memory or disk offloading techniques are proposed by state of art. Despite advanced asynchronous multi-tier read/write strategies, such offloading strategies result in significant I/O overheads in the critical path of training, resulting in slower iterations. To this end, we propose MLP-Offload, a novel multi-level, multi-path offloading engine specifically designed for optimizing LLM training on resource-constrained setups by mitigating I/O bottlenecks. We make several key observations that drive the design of MLP-Offload, such as I/O overheads during the update dominate the iteration time; I/O bandwidth of the third-level remote storage tier remains unutilized; and, contention due to concurrent offloading amplifies I/O bottlenecks. Driven by these insights, we design and implement MLP-Offload to offload the optimizer states across multiple tiers in a cache-efficient and concurrency-controlled fashion to mitigate I/O bottlenecks during the backward and update phases. Evaluations on models up to 280B parameters shows that MLP-Offload achieves 2.5$ imes$ faster iterations compared to the state-of-the-art LLM training runtimes.