🤖 AI Summary
To address memory constraints and low inter-GPU communication bandwidth on consumer-grade GPUs (e.g., single 16-GB GPU or 4×RTX 4090), this work introduces the first end-to-end CUDA/C++ implementation of an 8-bit full-precision training framework for efficient pretraining and fine-tuning of medium-scale language models (3B–32B). Methodologically, it integrates activation checkpointing, GPU memory offloading, Copy Engine-accelerated collective communication, mixed-precision arithmetic, and memory-aware scheduling—enabling a standard 8-bit training pipeline without algorithmic approximations. Experiments demonstrate that the framework trains a 7B model on a single GPU and scales to a 32B model on four GPUs, achieving ~50% FLOP utilization. Its training efficiency matches that of cloud-based production systems, substantially lowering the barrier to large-model pretraining on accessible hardware.
📝 Abstract
We present LLMQ, an end-to-end CUDA/C++ implementation for medium-sized language-model training, e.g. 3B to 32B parameters, on affordable, commodity GPUs. These devices are characterized by low memory availability and slow communication compared to datacentre-grade GPUs. Consequently, we showcase a range of optimizations that target these bottlenecks, including activation checkpointing, offloading, and copy-engine based collectives. LLMQ is able to train or fine-tune a 7B model on a single 16GB mid-range gaming card, or a 32B model on a workstation equipped with 4 RTX 4090s. This is achieved while executing a standard 8-bit training pipeline, without additional algorithmic approximations, and maintaining FLOP utilization of around 50%. The efficiency of LLMQ rivals that of production-scale systems on much more expensive cloud-grade GPUs.