🤖 AI Summary
This work addresses the limitations of existing data-driven GPU performance models, which struggle to generalize across architectures and accurately model complex production-level kernels—hindering efficient hardware selection for large language model (LLM) inference. To overcome this, we propose SynPerf, the first approach that integrates analytical modeling with machine learning. SynPerf employs an analytical model to quantify kernel demands on GPU heterogeneous instruction pipelines and leverages a machine learning model to capture cross-pipeline interactions and resource dependencies, enabling high-fidelity performance prediction. Evaluated across 11 GPU architectures, SynPerf achieves kernel-level and end-to-end inference prediction errors as low as 6.1% and 8.5%, respectively—improving over state-of-the-art methods by 6.7× and 4.4×. Furthermore, it successfully guides the optimization of MoE Triton kernels, yielding up to a 1.7× speedup.
📝 Abstract
The rapid expansion of Transformer-based large language models has dramatically increased the need for high-performance GPUs. As a result, there is growing demand for fast, accurate, and widely generalizable GPU performance models to support next-generation hardware selection and system-level exploration. However, current data-driven methods are limited, exhibiting poor generalization across hardware and inadequate modeling of complex production-level kernels common in modern inference stacks. To address these issues, we present SyncPerf, a unified GPU modeling framework. This approach first employs an analytical model to quantify a given kernel's demands on the GPU's heterogeneous instruction pipelines. These analytical features are then fed into a machine learning (ML) model to capture complex cross-pipeline interactions and resource dependencies, enabling high-fidelity performance prediction. Our evaluation across 11 GPU types from four generations of major architectures on two widely-used serving systems demonstrates that SyncPerf delivers high fidelity and strong generalizability. It achieves accurate predictions, with only 6.1% average error at the kernel level and 8.5% for end-to-end inference -- reducing the error of state-of-the-art methods by 6.7x and 4.4x, respectively. We also demonstrate SynPerf's value"beyond simulation"by utilizing its performance ceiling to diagnose implementation shortcomings and guide the optimization of a production fused MoE Triton kernel, achieving up to 1.7x speedup.