🤖 AI Summary
This work addresses performance bottlenecks and dynamic behavior modeling in LLM inference on CPU–GPU coupled architectures—specifically comparing loosely coupled PCIe-based A100/H100 systems versus tightly coupled GH200 (Grace Hopper). We propose SKIP, a fine-grained profiling framework, and introduce TKLQT (Total Kernel Launch and Queuing Time), a novel metric that quantifies end-to-end kernel submission and queuing overhead. TKLQT reveals, for the first time, that GH200’s CPU-bound region expands fourfold under low-batch settings due to Grace CPU scheduling latency. Building on this insight, we design an operator–kernel fusion strategy tailored for low-batch optimization. Experimental results show: (i) GH200 achieves 1.9×–2.7× lower prefill latency than PCIe-based systems at high batch sizes; (ii) TKLQT accurately identifies CPU/GPU-bound transition points; and (iii) kernel fusion significantly alleviates low-batch latency bottlenecks on GH200.
📝 Abstract
Large language model (LLM)-based inference workloads increasingly dominate data center costs and resource utilization. Therefore, understanding the inference workload characteristics on evolving CPU-GPU coupled architectures is crucial for optimization. This paper presents an in-depth analysis of LLM inference behavior on loosely-coupled (PCIe A100/H100) and closely-coupled (GH200) systems. We analyze performance dynamics using fine-grained operator-to-kernel trace analysis, facilitated by our novel profiler SKIP and metrics like Total Kernel Launch and Queuing Time (TKLQT). Results show that closely-coupled (CC) GH200 significantly outperforms loosely-coupled (LC) systems at large batch sizes, achieving 1.9x-2.7x faster prefill latency for Llama 3.2-1B. However, our analysis also reveals that GH200 remains CPU-bound up to 4x larger batch sizes than LC systems. In this extended CPU-bound region, we identify the performance characteristics of the Grace CPU as a key factor contributing to higher inference latency at low batch sizes on GH200. We demonstrate that TKLQT accurately identifies this CPU/GPU-bound transition point. Based on this analysis, we further show that kernel fusion offers significant potential to mitigate GH200's low-batch latency bottleneck by reducing kernel launch overhead. This detailed kernel-level characterization provides critical insights for optimizing diverse CPU-GPU coupling strategies. This work is an initial effort, and we plan to explore other major AI/DL workloads that demand different degrees of CPU-GPU heterogeneous architectures.