Runtime-Orchestrated Second-Order Optimization for Scalable LLM Training

📅 2026-05-15
📈 Citations: 0
Influential: 0
📄 PDF

career value

207K/year
🤖 AI Summary
This work addresses the prohibitive system overhead of second-order optimization methods in large language model training, primarily caused by their massive optimizer states. To overcome this challenge, the authors propose Asteria, a runtime system that decouples second-order optimization logic from the GPU critical path. Asteria orchestrates optimizer state management, background computation, and distributed synchronization through dynamic state partitioning, asynchronous inverse-root computation, bounded-delay synchronization protocols, and topology-aware coordination. This approach transcends the limitations of prior methods that merely simplify the optimizer, enabling second-order training of 1B-parameter models on a single GB10 GPU. On multi-node GH200 systems, Asteria substantially reduces optimizer overhead and latency variance, accelerating convergence for 7B-parameter models while preserving the optimization efficacy of SOAP and KL-Shampoo.
📝 Abstract
Second-order methods offer an attractive path toward more sample-efficient LLM training, but their practical use is often blocked by the systems cost of maintaining and updating large matrix-based optimizer states. We introduce \textbf{Asteria}, a runtime system designed to remove this bottleneck by separating second-order optimization logic from the critical GPU training path. Rather than keeping all preconditioner state on the accelerator, Asteria dynamically distributes optimizer state across GPU memory, CPU memory, and optional NVMe storage according to architectural constraints and runtime pressure. It further uses training hooks to prepare shadow states in advance, allowing expensive inverse-root computations to proceed asynchronously on the host while GPU computation continues. For distributed training, Asteria employs a bounded-staleness protocol that limits synchronization frequency while preserving optimizer effectiveness through topology-aware coordination. We evaluate Asteria on both memory-constrained and distributed training settings. On a DGX Spark platform with a single GB10 GPU and 128GB unified memory, Asteria supports second-order training for a 1B-parameter language model. On multi-node GH200 systems, it lowers visible optimizer overhead, reduces recurring latency spikes, accelerates convergence in wall-clock time, and maintains the optimization advantages of SOAP and KL-Shampoo in a 7B-parameter language model. Our results suggest that second-order LLM training can be made practical not by simplifying the optimizer alone, but by rethinking how optimizer state, background computation, and distributed synchronization are managed at the runtime level.
Problem

Research questions and friction points this paper is trying to address.

second-order optimization
LLM training
optimizer state
systems cost
scalability
Innovation

Methods, ideas, or system contributions that make the work stand out.

second-order optimization
runtime system
optimizer state management
asynchronous computation
distributed training