GPU Performance Portability needs Autotuning

📅 2025-04-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language model (LLM) inference suffers from poor performance portability across heterogeneous GPUs (e.g., NVIDIA, AMD, Intel), heavy reliance on vendor-provided closed-source optimizations, and labor-intensive manual kernel tuning. Method: We propose a portable, high-performance execution framework that requires no user code modification, achieved by tightly integrating just-in-time (JIT) compilation with fine-grained kernel parameter autotuning—enabling joint compile-time optimization between JIT and autotuning for the first time. This synergy expands the configuration search space by 15× and significantly enhances generated kernel diversity. Results: Evaluated on Flash Attention, our approach outperforms vendor-optimized libraries across all three GPU architectures, achieving up to 230% speedup. Generated kernels are 70× smaller in binary size, and manual tuning is fully eliminated. Our method establishes a new paradigm for efficient, cross-platform LLM deployment.

Technology Category

Application Category

📝 Abstract
As LLMs grow in complexity, achieving state-of-the-art performance requires tight co-design across algorithms, software, and hardware. Today's reliance on a single dominant platform limits portability, creates vendor lock-in, and raises barriers for new AI hardware. In this work, we make the case for combining just-in-time (JIT) compilation with kernel parameter autotuning to enable portable, state-of-the-art performance LLM execution without code changes. Focusing on flash attention -- a widespread performance-critical LLM kernel -- we demonstrate that this approach explores up to 15x more kernel parameter configurations, produces significantly more diverse code across multiple dimensions, and even outperforms vendor-optimized implementations by up to 230%, all while reducing kernel code size by 70x and eliminating manual code optimizations. Our results highlight autotuning as a promising path to unlocking model portability across GPU vendors.
Problem

Research questions and friction points this paper is trying to address.

Achieving portable high-performance LLM execution across GPU vendors
Reducing reliance on single-platform vendor lock-in for AI hardware
Enhancing flash attention kernel efficiency via autotuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines JIT compilation with autotuning
Explores 15x more kernel configurations
Reduces code size by 70x
🔎 Similar Papers
No similar papers found.
B
Burkhard Ringlein
IBM Research Europe, S鋗erstrasse 4, 8803 R黶chlikon, Switzerland
Thomas Parnell
Thomas Parnell
Principal Research Scientist, IBM Research
Machine Learning and Systems
R
R. Stoica
IBM Research Europe, S鋗erstrasse 4, 8803 R黶chlikon, Switzerland