Characterizing and Optimizing LLM Inference Workloads on CPU-GPU Coupled Architectures

📅 2025-04-16

📈 Citations: 0

✨ Influential: 0

career value

237K/year

🤖 AI Summary

This work addresses performance bottlenecks and dynamic behavior modeling in LLM inference on CPU–GPU coupled architectures—specifically comparing loosely coupled PCIe-based A100/H100 systems versus tightly coupled GH200 (Grace Hopper). We propose SKIP, a fine-grained profiling framework, and introduce TKLQT (Total Kernel Launch and Queuing Time), a novel metric that quantifies end-to-end kernel submission and queuing overhead. TKLQT reveals, for the first time, that GH200’s CPU-bound region expands fourfold under low-batch settings due to Grace CPU scheduling latency. Building on this insight, we design an operator–kernel fusion strategy tailored for low-batch optimization. Experimental results show: (i) GH200 achieves 1.9×–2.7× lower prefill latency than PCIe-based systems at high batch sizes; (ii) TKLQT accurately identifies CPU/GPU-bound transition points; and (iii) kernel fusion significantly alleviates low-batch latency bottlenecks on GH200.

Technology Category

Application Category

📝 Abstract

Large language model (LLM)-based inference workloads increasingly dominate data center costs and resource utilization. Therefore, understanding the inference workload characteristics on evolving CPU-GPU coupled architectures is crucial for optimization. This paper presents an in-depth analysis of LLM inference behavior on loosely-coupled (PCIe A100/H100) and closely-coupled (GH200) systems. We analyze performance dynamics using fine-grained operator-to-kernel trace analysis, facilitated by our novel profiler SKIP and metrics like Total Kernel Launch and Queuing Time (TKLQT). Results show that closely-coupled (CC) GH200 significantly outperforms loosely-coupled (LC) systems at large batch sizes, achieving 1.9x-2.7x faster prefill latency for Llama 3.2-1B. However, our analysis also reveals that GH200 remains CPU-bound up to 4x larger batch sizes than LC systems. In this extended CPU-bound region, we identify the performance characteristics of the Grace CPU as a key factor contributing to higher inference latency at low batch sizes on GH200. We demonstrate that TKLQT accurately identifies this CPU/GPU-bound transition point. Based on this analysis, we further show that kernel fusion offers significant potential to mitigate GH200's low-batch latency bottleneck by reducing kernel launch overhead. This detailed kernel-level characterization provides critical insights for optimizing diverse CPU-GPU coupling strategies. This work is an initial effort, and we plan to explore other major AI/DL workloads that demand different degrees of CPU-GPU heterogeneous architectures.

Problem

Research questions and friction points this paper is trying to address.

Analyze LLM inference performance on CPU-GPU coupled architectures

Identify CPU-bound limitations in closely-coupled systems like GH200

Propose kernel fusion to reduce latency in low-batch scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-grained operator-to-kernel trace analysis

Novel profiler SKIP with TKLQT metrics

Kernel fusion reduces launch overhead

🔎 Similar Papers

No similar papers found.