🤖 AI Summary
This work addresses the severe cold-start latency (ranging from tens of seconds to minutes) incurred by large language models during automatic scaling and dynamic parallelism strategy switching due to CUDA graph reinitialization. The authors propose a template-based CUDA graph context materialization method that, for the first time, enables graph serialization and reconstruction compatible with dynamic parallel strategies. By persisting graph topology and execution context offline, enforcing deterministic memory layouts, automatically extracting kernel binaries, and deploying single-GPU templates across multiple GPUs with rank-specific communication state patching, the approach drastically reduces cold-start overhead. Evaluated on dense and Mixture-of-Experts models up to 235B parameters, it achieves up to 99% latency reduction—e.g., cutting Qwen3-235B-A22B’s initialization time from 10 minutes to 3.9 seconds—while preserving the throughput benefits of CUDA graphs.
📝 Abstract
Modern LLM service providers increasingly rely on autoscaling and parallelism reconfiguration to respond to rapidly changing workloads, but cold-start latency remains a major bottleneck. While recent systems have reduced model weight loading to seconds, CUDA graph capture still takes tens of seconds to minutes and often dominates startup. Unfortunately, CUDA graphs cannot be naively serialized: beyond graph topology, they are tightly coupled to execution context, including device addresses embedded in kernel arguments and kernel code lazily loaded during warmup. Existing approaches either rely on brittle kernel-specific patching or heavyweight process-level checkpoint/restore that are inflexible to dynamic parallelism switching. We present Foundry, a template-based CUDA graph context materialization system that persists both graph topology and execution context during an offline processing stage, and reconstructs executable graphs online with negligible overhead. Foundry enforces deterministic memory layouts, automatically extracts and reloads kernel binaries required by captured graphs, and reduces online reconstruction costs through topology-based templating. For distributed serving, Foundry further enables a single-GPU offline capture to generate templates for multi-GPU deployments by patching only rank-dependent communication state. Across dense and MoE models up to 235B parameters, Foundry reduces cold-start latency by up to 99%, cutting the initialization time of Qwen3-235B-A22B from 10 minutes to 3.9 seconds while preserving the throughput gains of CUDA graphs.