GPU Performance Portability needs Autotuning

📅 2025-04-30

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

Large language model (LLM) inference suffers from poor performance portability across heterogeneous GPUs (e.g., NVIDIA, AMD, Intel), heavy reliance on vendor-provided closed-source optimizations, and labor-intensive manual kernel tuning. Method: We propose a portable, high-performance execution framework that requires no user code modification, achieved by tightly integrating just-in-time (JIT) compilation with fine-grained kernel parameter autotuning—enabling joint compile-time optimization between JIT and autotuning for the first time. This synergy expands the configuration search space by 15× and significantly enhances generated kernel diversity. Results: Evaluated on Flash Attention, our approach outperforms vendor-optimized libraries across all three GPU architectures, achieving up to 230% speedup. Generated kernels are 70× smaller in binary size, and manual tuning is fully eliminated. Our method establishes a new paradigm for efficient, cross-platform LLM deployment.

Technology Category

Application Category

📝 Abstract

As LLMs grow in complexity, achieving state-of-the-art performance requires tight co-design across algorithms, software, and hardware. Today's reliance on a single dominant platform limits portability, creates vendor lock-in, and raises barriers for new AI hardware. In this work, we make the case for combining just-in-time (JIT) compilation with kernel parameter autotuning to enable portable, state-of-the-art performance LLM execution without code changes. Focusing on flash attention -- a widespread performance-critical LLM kernel -- we demonstrate that this approach explores up to 15x more kernel parameter configurations, produces significantly more diverse code across multiple dimensions, and even outperforms vendor-optimized implementations by up to 230%, all while reducing kernel code size by 70x and eliminating manual code optimizations. Our results highlight autotuning as a promising path to unlocking model portability across GPU vendors.

Problem

Research questions and friction points this paper is trying to address.

Achieving portable high-performance LLM execution across GPU vendors

Reducing reliance on single-platform vendor lock-in for AI hardware

Enhancing flash attention kernel efficiency via autotuning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines JIT compilation with autotuning

Explores 15x more kernel configurations

Reduces code size by 70x

🔎 Similar Papers

Taking GPU Programming Models to Task for Performance Portability

2024-02-14Proceedings of the 39th ACM International Conference on SupercomputingCitations: 3

Nvidia

The base salary range is 184,000 USD - 287,500 USD. You will also be eligible for equity and benefits.

US, WA, Redmond / US, TX, Remote / US, WA, Remote

Authors to Follow