Horizon-LM: A RAM-Centric Architecture for LLM Training

📅 2026-02-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the GPU memory bottleneck in post-training large language models—such as through instruction tuning and alignment—which hinders efficient single-node execution. The authors propose a host-memory-centric training architecture that fundamentally departs from the conventional GPU-centric paradigm by designating the CPU as the authoritative parameter store and treating GPUs solely as transient compute units. By integrating explicit recomputation, manual gradient propagation, and a pipelined double-buffering mechanism, the approach decouples model scale from GPU count and achieves memory consumption that strictly matches the theoretical lower bound dictated by model parameters. Experiments demonstrate successful training of a 120B-parameter model on a single H200 GPU with 1.5TB host memory, achieving 12.2× higher throughput than DeepSpeed ZeRO-3 with CPU offloading on an A100 while preserving numerical correctness.

Technology Category

Application Category

📝 Abstract
The rapid growth of large language models (LLMs) has outpaced the evolution of single-GPU hardware, making model scale increasingly constrained by memory capacity rather than computation. While modern training systems extend GPU memory through distributed parallelism and offloading across CPU and storage tiers, they fundamentally retain a GPU-centric execution paradigm in which GPUs host persistent model replicas and full autograd graphs. As a result, scaling large models remains tightly coupled to multi-GPU clusters, complex distributed runtimes, and unpredictable host memory consumption, creating substantial barriers for node-scale post-training workloads such as instruction tuning, alignment, and domain adaptation. We present Horizon-LM, a memory-centric training system that redefines the roles of CPU and GPU for large-model optimization. Horizon-LM treats host memory as the authoritative parameter store and uses GPUs solely as transient compute engines through a CPU-master, GPU-template execution model. By eliminating persistent GPU-resident modules and autograd graphs, employing explicit recomputation with manual gradient propagation, and introducing a pipelined double-buffered execution engine, Horizon-LM decouples model scale from GPU count and bounds memory usage to the theoretical parameter footprint. On a single H200 GPU with 1.5\,TB host RAM, Horizon-LM reliably trains models up to 120B parameters. On a standard single A100 machine, Horizon-LM achieves up to 12.2$\times$ higher training throughput than DeepSpeed ZeRO-3 with CPU offloading while preserving numerical correctness. Across platforms and scales, Horizon-LM sustains high device utilization and predictable memory growth, demonstrating that host memory, not GPU memory, defines the true feasibility boundary for node-scale large-model training.
Problem

Research questions and friction points this paper is trying to address.

large language models
memory bottleneck
single-node training
GPU memory constraint
post-training
Innovation

Methods, ideas, or system contributions that make the work stand out.

RAM-centric architecture
CPU-master GPU-template execution
explicit recomputation
pipelined double-buffered execution
memory-bounded LLM training
🔎 Similar Papers
No similar papers found.