NonGEMM Bench: Understanding the Performance Horizon of the Latest ML Workloads with NonGEMM Workloads

📅 2024-04-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Non-GEMM operators have emerged as a significant performance bottleneck in modern hardware and quantized models, yet their impact has lacked systematic, quantitative characterization. Method: This work presents the first end-to-end latency decomposition across 16 mainstream ML models (from Hugging Face and TorchVision) on CPU, GPU, and datacenter platforms—covering both FP32 and quantized deployments. Contribution/Results: We find non-GEMM operations account for 11.3%–73.6% of total inference latency, with quantization exacerbating this overhead. Traditional operator fusion reduces—but does not eliminate—non-GEMM latency, leaving 15%–48% of total latency attributable to such operations. We precisely identify the most expensive non-GEMM operator types per model and demonstrate that current deployment-stack co-optimizations fail to resolve the issue fundamentally. This study establishes non-GEMM optimization as a critical path for next-generation AI acceleration, providing empirical evidence and prioritized guidance for targeted system-level improvements.

Technology Category

Application Category

📝 Abstract
Among ML operators today, GEneralMatrix Multiplication (GEMM)-based operators are known to be key operators that build the main backbone of ML models. As their computational overhead dominates the overall execution time (e.g., 42.8% - 96.6% in our results), GEMM operators have been the prime optimization targets for fast ML inference. This led to advanced GPUs and accelerators available today, which provided significant boost in the GEMM performance compared to CPUs, aligned with the lesson from Amdahl's law. However, accelerating GEMM has significantly shifted the Amdahl's law's landscape for ML inference; due to the decreased GEMM execution time, the relative execution time of non-GEMM operators is not dominant. Although the importance of non-GEMM performance is increasing, we have little knowledge about the non-GEMM performance horizon in the latest hardware platforms and models. Therefore, to guide non-GEMM-oriented optimizations, we conduct a thorough performance analysis of 16 widely adopted ML models in Hugging Face and Torchvision on workstation and data center platforms with/without GPUs. We discover that non-GEMM performance bottleneck is a considerable issue across all the platforms and models, accounting for 11.3% to 73.6% of total latency, on average. The challenge significantly aggravates when we apply quantization, which is a common model compression technique, due to the boosted GEMM performance and extra non-GEMM operators for dequantization and requantization. To provide insights into non-GEMM optimization targets, we demystify the most dominant non-GEMM operators for each model and deployment software.We also show that widely adopted optimizations such as operator fusion do not completely address the non-GEMM performance bottleneck, where non-GEMM operators still account for 15% to 48% of total latency.
Problem

Research questions and friction points this paper is trying to address.

Analyzes non-GEMM performance bottlenecks in ML models.
Investigates impact of quantization on non-GEMM operator latency.
Identifies dominant non-GEMM operators for optimization targets.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzes non-GEMM performance in ML models
Identifies non-GEMM bottlenecks across platforms
Highlights limitations of operator fusion optimizations
🔎 Similar Papers
No similar papers found.