Characterizing CPU-Induced Slowdowns in Multi-GPU LLM Inference

📅 2026-03-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study systematically investigates and quantifies the critical impact of CPU bottlenecks on large language model (LLM) inference performance in multi-GPU settings, where insufficient CPU resources often lead to suboptimal GPU utilization and elevated latency—even when GPUs are not fully saturated. Through comprehensive profiling of real-world LLM serving workloads, the work identifies key CPU-induced inefficiencies, including kernel launch delays, communication stalls, and tokenization overhead. Experimental results demonstrate that moderately increasing CPU core count—without adding more GPUs—can reduce time-to-first-token latency by 1.36× to 5.40× while substantially improving system stability and responsiveness. These findings underscore the necessity of balanced CPU-GPU provisioning for efficient LLM inference deployment.

Technology Category

Application Category

📝 Abstract
Large-scale machine learning workloads increasingly rely on multi-GPU systems, yet their performance is often limited by an overlooked component: the CPU. Through a detailed study of modern large language model (LLM) inference and serving workloads, we find that multi-GPU performance frequently degrades not because GPUs are saturated, but because CPUs fail to keep the GPUs busy. Under limited CPU allocations, systems exhibit symptoms such as delayed kernel launch, stalled communication, and increased tokenization latency, leading to severe GPU underutilization even when ample GPU resources are available. This work presents a systematic analysis of CPU-induced slowdowns in multi-GPU LLM inference. We show that these bottlenecks persist even in serving stacks that employ process-level separation and modern GPU-side optimizations such as CUDA Graphs. Since the marginal cost of additional CPU cores is small relative to GPU instance pricing, our evaluation indicates that increasing the number of CPU cores can substantially improve performance and stability at minimal additional cost. Under moderate serving load, we observe that CPU-starved configurations frequently time out, while providing adequate CPU resources restores responsiveness and reduces time-to-first-token (TTFT) latency by 1.36-5.40x across configurations, all without requiring additional GPUs. This work shows that CPU provisioning is a crucial factor in multi-GPU LLM inference configuration, helping prevent control-side bottlenecks.
Problem

Research questions and friction points this paper is trying to address.

CPU bottleneck
multi-GPU LLM inference
GPU underutilization
tokenization latency
time-to-first-token
Innovation

Methods, ideas, or system contributions that make the work stand out.

CPU-induced slowdown
multi-GPU LLM inference
GPU underutilization
time-to-first-token (TTFT)
CUDA Graphs
🔎 Similar Papers
No similar papers found.
Euijun Chung
Euijun Chung
Ph.D. student at Georgia Institute of Technology
Computer Architecture
Y
Yuxiao Jia
Georgia Institute of Technology
A
Aaron Jezghani
Georgia Institute of Technology
Hyesoon Kim
Hyesoon Kim
Georgia Tech
Computer ArchitectureCompiler