🤖 AI Summary
Training ultra-large language models (100B+ parameters) in full precision on a single GPU is constrained by limited device memory and CPU–GPU bandwidth. This work proposes a memory-centric training architecture that keeps model parameters and optimizer states in host memory, treating the GPU as a transient compute unit. By streaming layer parameters with a double-buffered pipeline and employing a stateless automatic differentiation mechanism, the approach eliminates persistent device state and overcomes the bandwidth bottleneck. The method successfully trains a 120B-parameter model on a single H200 GPU and achieves 1.84× higher throughput than DeepSpeed ZeRO-3 with CPU offloading for a 14B-parameter model. It also enables efficient training of a 7B-parameter model with a 512k-token context length on a single GH200 system.
📝 Abstract
We present MegaTrain, a memory-centric system that efficiently trains 100B+ parameter large language models at full precision on a single GPU. Unlike traditional GPU-centric systems, MegaTrain stores parameters and optimizer states in host memory (CPU memory) and treats GPUs as transient compute engines. For each layer, we stream parameters in and compute gradients out, minimizing persistent device state. To battle the CPU-GPU bandwidth bottleneck, we adopt two key optimizations. 1) We introduce a pipelined double-buffered execution engine that overlaps parameter prefetching, computation, and gradient offloading across multiple CUDA streams, enabling continuous GPU execution. 2) We replace persistent autograd graphs with stateless layer templates, binding weights dynamically as they stream in, eliminating persistent graph metadata while providing flexibility in scheduling. On a single H200 GPU with 1.5TB host memory, MegaTrain reliably trains models up to 120B parameters. It also achieves 1.84$\times$ the training throughput of DeepSpeed ZeRO-3 with CPU offloading when training 14B models. MegaTrain also enables 7B model training with 512k token context on a single GH200.