MLP-Offload: Multi-Level, Multi-Path Offloading for LLM Pre-training to Break the GPU Memory Wall

πŸ“… 2025-09-02
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Training large language models (LLMs) with parameter counts exceeding aggregate multi-GPU memory capacity faces severe bottlenecks, as existing asynchronous multi-level offloading techniques incur substantial I/O overhead, significantly slowing iteration throughput. Method: We propose a multi-level concurrent-control offloading architecture that enables efficient caching and parallel transmission of optimizer states during both backward propagation and parameter update phases. Our approach integrates host-memory–disk collaborative offloading, cache-aware optimizer state management, and asynchronous I/O scheduling to fully utilize remote storage bandwidth and mitigate I/O contention. Contribution/Results: Evaluated on a 280B-parameter model, our system achieves a 2.5Γ— speedup in training iteration time over state-of-the-art systems. It effectively breaks the GPU memory barrier and substantially improves training efficiency under resource-constrained conditions.

Technology Category

Application Category

πŸ“ Abstract
Training LLMs larger than the aggregated memory of multiple GPUs is increasingly necessary due to the faster growth of LLM sizes compared to GPU memory. To this end, multi-tier host memory or disk offloading techniques are proposed by state of art. Despite advanced asynchronous multi-tier read/write strategies, such offloading strategies result in significant I/O overheads in the critical path of training, resulting in slower iterations. To this end, we propose MLP-Offload, a novel multi-level, multi-path offloading engine specifically designed for optimizing LLM training on resource-constrained setups by mitigating I/O bottlenecks. We make several key observations that drive the design of MLP-Offload, such as I/O overheads during the update dominate the iteration time; I/O bandwidth of the third-level remote storage tier remains unutilized; and, contention due to concurrent offloading amplifies I/O bottlenecks. Driven by these insights, we design and implement MLP-Offload to offload the optimizer states across multiple tiers in a cache-efficient and concurrency-controlled fashion to mitigate I/O bottlenecks during the backward and update phases. Evaluations on models up to 280B parameters shows that MLP-Offload achieves 2.5$ imes$ faster iterations compared to the state-of-the-art LLM training runtimes.
Problem

Research questions and friction points this paper is trying to address.

Optimizing LLM training on resource-constrained GPU setups
Mitigating I/O bottlenecks in multi-tier memory offloading
Reducing iteration time during backward and update phases
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-level multi-path offloading for LLM training
Cache-efficient concurrency-controlled optimizer state offloading
Mitigates I/O bottlenecks across multiple storage tiers
πŸ”Ž Similar Papers
No similar papers found.