GreedySnake: Accelerating SSD-Offloaded LLM Training with Efficient Scheduling and Optimizer Step Overlapping

📅 2025-12-19

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

To address low small-batch throughput and severe I/O bottlenecks in large language model (LLM) training with SSD offloading, this paper proposes an efficient training system based on vertical scheduling. Our method introduces a novel *vertical micro-batch scheduling* paradigm—the first to enable fine-grained overlap between optimizer updates and the next iteration’s forward pass—integrated with gradient accumulation, heterogeneous SSD-based storage offloading, compute-I/O overlapping, and full compatibility with ZeRO-Infinity. This design tightly approaches the Roofline theoretical performance bound. Experiments on a single A100 GPU show that our system achieves 1.93–2.53× higher throughput than ZeRO-Infinity for GPT-65B and GPT-175B training, significantly improving per-GPU saturation throughput. The approach establishes a new paradigm for cost-effective, high-efficiency LLM training.

Technology Category

Application Category

📝 Abstract

SSD-offloaded training offers a practical and promising approach to making LLM training cost-effective. Building on gradient accumulation with micro-batches, this paper introduces GreedySnake, a new SSD-offloaded training system that employs vertical scheduling, which executes all microbatches of a layer before proceeding to the next. Compared to existing systems that use horizontal scheduling (i.e., executing micro-batches sequentially), GreedySnake achieves higher training throughput with smaller batch sizes, bringing the system much closer to the ideal scenario predicted by the roofline model. To further mitigate the I/O bottleneck, GreedySnake overlaps part of the optimization step with the forward pass of the next iteration. Experimental results on A100 GPUs show that GreedySnake achieves saturated training throughput improvements over ZeRO-Infinity: 1.96x on 1 GPU and 1.93x on 4 GPUs for GPT-65B, and 2.53x on 1 GPU for GPT-175B. The code is open-sourced at https://github.com/npz7yyk/GreedySnake

Problem

Research questions and friction points this paper is trying to address.

Accelerates SSD-offloaded LLM training via vertical scheduling

Overlaps optimizer steps with forward passes to reduce I/O bottlenecks

Improves training throughput for large models like GPT-65B/175B

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses vertical scheduling for micro-batch execution

Overlaps optimizer step with next forward pass

Achieves higher throughput with smaller batch sizes

🔎 Similar Papers

SSDTrain: An Activation Offloading Framework to SSDs for Faster Large Language Model Training