MLP-Offload: Multi-Level, Multi-Path Offloading for LLM Pre-training to Break the GPU Memory Wall

📅 2025-09-02

📈 Citations: 0

✨ Influential: 0

career value

231K/year

🤖 AI Summary

Training large language models (LLMs) with parameter counts exceeding aggregate multi-GPU memory capacity faces severe bottlenecks, as existing asynchronous multi-level offloading techniques incur substantial I/O overhead, significantly slowing iteration throughput. Method: We propose a multi-level concurrent-control offloading architecture that enables efficient caching and parallel transmission of optimizer states during both backward propagation and parameter update phases. Our approach integrates host-memory–disk collaborative offloading, cache-aware optimizer state management, and asynchronous I/O scheduling to fully utilize remote storage bandwidth and mitigate I/O contention. Contribution/Results: Evaluated on a 280B-parameter model, our system achieves a 2.5× speedup in training iteration time over state-of-the-art systems. It effectively breaks the GPU memory barrier and substantially improves training efficiency under resource-constrained conditions.

Technology Category

Application Category

📝 Abstract

Training LLMs larger than the aggregated memory of multiple GPUs is increasingly necessary due to the faster growth of LLM sizes compared to GPU memory. To this end, multi-tier host memory or disk offloading techniques are proposed by state of art. Despite advanced asynchronous multi-tier read/write strategies, such offloading strategies result in significant I/O overheads in the critical path of training, resulting in slower iterations. To this end, we propose MLP-Offload, a novel multi-level, multi-path offloading engine specifically designed for optimizing LLM training on resource-constrained setups by mitigating I/O bottlenecks. We make several key observations that drive the design of MLP-Offload, such as I/O overheads during the update dominate the iteration time; I/O bandwidth of the third-level remote storage tier remains unutilized; and, contention due to concurrent offloading amplifies I/O bottlenecks. Driven by these insights, we design and implement MLP-Offload to offload the optimizer states across multiple tiers in a cache-efficient and concurrency-controlled fashion to mitigate I/O bottlenecks during the backward and update phases. Evaluations on models up to 280B parameters shows that MLP-Offload achieves 2.5$ imes$ faster iterations compared to the state-of-the-art LLM training runtimes.

Problem

Research questions and friction points this paper is trying to address.

Optimizing LLM training on resource-constrained GPU setups

Mitigating I/O bottlenecks in multi-tier memory offloading

Reducing iteration time during backward and update phases

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-level multi-path offloading for LLM training

Cache-efficient concurrency-controlled optimizer state offloading

Mitigates I/O bottlenecks across multiple storage tiers

🔎 Similar Papers

Practical offloading for fine-tuning LLM on commodity GPU via learned sparse projectors