GreedySnake: Accelerating SSD-Offloaded LLM Training with Efficient Scheduling and Optimizer Step Overlapping

📅 2025-12-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address low small-batch throughput and severe I/O bottlenecks in large language model (LLM) training with SSD offloading, this paper proposes an efficient training system based on vertical scheduling. Our method introduces a novel *vertical micro-batch scheduling* paradigm—the first to enable fine-grained overlap between optimizer updates and the next iteration’s forward pass—integrated with gradient accumulation, heterogeneous SSD-based storage offloading, compute-I/O overlapping, and full compatibility with ZeRO-Infinity. This design tightly approaches the Roofline theoretical performance bound. Experiments on a single A100 GPU show that our system achieves 1.93–2.53× higher throughput than ZeRO-Infinity for GPT-65B and GPT-175B training, significantly improving per-GPU saturation throughput. The approach establishes a new paradigm for cost-effective, high-efficiency LLM training.

Technology Category

Application Category

📝 Abstract
SSD-offloaded training offers a practical and promising approach to making LLM training cost-effective. Building on gradient accumulation with micro-batches, this paper introduces GreedySnake, a new SSD-offloaded training system that employs vertical scheduling, which executes all microbatches of a layer before proceeding to the next. Compared to existing systems that use horizontal scheduling (i.e., executing micro-batches sequentially), GreedySnake achieves higher training throughput with smaller batch sizes, bringing the system much closer to the ideal scenario predicted by the roofline model. To further mitigate the I/O bottleneck, GreedySnake overlaps part of the optimization step with the forward pass of the next iteration. Experimental results on A100 GPUs show that GreedySnake achieves saturated training throughput improvements over ZeRO-Infinity: 1.96x on 1 GPU and 1.93x on 4 GPUs for GPT-65B, and 2.53x on 1 GPU for GPT-175B. The code is open-sourced at https://github.com/npz7yyk/GreedySnake
Problem

Research questions and friction points this paper is trying to address.

Accelerates SSD-offloaded LLM training via vertical scheduling
Overlaps optimizer steps with forward passes to reduce I/O bottlenecks
Improves training throughput for large models like GPT-65B/175B
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses vertical scheduling for micro-batch execution
Overlaps optimizer step with next forward pass
Achieves higher throughput with smaller batch sizes
Y
Yikang Yue
University of Illinois at Urbana-Champaign
Y
Yishu Yin
Tsinghua University
Xuehai Qian
Xuehai Qian
Tsinghua University
Computer ArchitectureComputer System