ZenFlow: Enabling Stall-Free Offloading Training via Asynchronous Updates

📅 2025-05-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address severe GPU compute stalls caused by GPU memory constraints in LLM fine-tuning, this paper proposes Zero-Wait Offloading (ZWO), a novel training framework. ZWO jointly models gradient importance and spatiotemporal locality to enable in-place updates of critical parameters while asynchronously offloading less important ones to CPU memory. It introduces a lightweight dynamic selection mechanism that eliminates global synchronization overhead and employs CPU-GPU asynchronous parameter updates coupled with computation-transfer pipelining. Compared to baselines such as ZeRO-Offload, ZWO achieves up to 5× end-to-end training speedup, reduces PCIe traffic by 50%, cuts GPU stall time by over 85%, and preserves model accuracy. Its core innovation lies in the first integration of gradient-importance awareness with asynchronous offloading—enabling efficient, low-overhead, accuracy-preserving offloaded training.

Technology Category

Application Category

📝 Abstract
Fine-tuning large language models (LLMs) often exceeds GPU memory limits, prompting systems to offload model states to CPU memory. However, existing offloaded training frameworks like ZeRO-Offload treat all parameters equally and update the full model on the CPU, causing severe GPU stalls, where fast, expensive GPUs sit idle waiting for slow CPU updates and limited-bandwidth PCIe transfers. We present ZenFlow, a new offloading framework that prioritizes important parameters and decouples updates between GPU and CPU. ZenFlow performs in-place updates of important gradients on GPU, while asynchronously offloading and accumulating less important ones on CPU, fully overlapping CPU work with GPU computation. To scale across GPUs, ZenFlow introduces a lightweight gradient selection method that exploits a novel spatial and temporal locality property of important gradients, avoiding costly global synchronization. ZenFlow achieves up to 5x end-to-end speedup, 2x lower PCIe traffic, and reduces GPU stalls by over 85 percent, all while preserving accuracy.
Problem

Research questions and friction points this paper is trying to address.

Overcoming GPU memory limits during LLM fine-tuning via offloading
Reducing GPU stalls caused by slow CPU updates and PCIe transfers
Prioritizing parameter updates to optimize GPU-CPU workload overlap
Innovation

Methods, ideas, or system contributions that make the work stand out.

Prioritizes important parameters for updates
Asynchronously offloads less important gradients
Lightweight gradient selection avoids global synchronization
🔎 Similar Papers
No similar papers found.
Tingfeng Lan
Tingfeng Lan
Department of Computer Science, University of Virginia
ML system
Y
Yusen Wu
University of Virginia
B
Bin Ma
University of California, Merced
Z
Zhaoyuan Su
University of Virginia
R
Rui Yang
University of Virginia
Tekin Bicer
Tekin Bicer
Computer Scientist, Data Science and Learning Division, Argonne National Laboratory
Parallel and Distributed SystemsData-Intensive ComputingBig DataCloud ComputingHigh-Performance Computing
D
Dong Li
University of California, Merced
Y
Yue Cheng
University of Virginia