🤖 AI Summary
Large language model (LLM) inference suffers from poor performance portability across heterogeneous GPUs (e.g., NVIDIA, AMD, Intel), heavy reliance on vendor-provided closed-source optimizations, and labor-intensive manual kernel tuning.
Method: We propose a portable, high-performance execution framework that requires no user code modification, achieved by tightly integrating just-in-time (JIT) compilation with fine-grained kernel parameter autotuning—enabling joint compile-time optimization between JIT and autotuning for the first time. This synergy expands the configuration search space by 15× and significantly enhances generated kernel diversity.
Results: Evaluated on Flash Attention, our approach outperforms vendor-optimized libraries across all three GPU architectures, achieving up to 230% speedup. Generated kernels are 70× smaller in binary size, and manual tuning is fully eliminated. Our method establishes a new paradigm for efficient, cross-platform LLM deployment.
📝 Abstract
As LLMs grow in complexity, achieving state-of-the-art performance requires tight co-design across algorithms, software, and hardware. Today's reliance on a single dominant platform limits portability, creates vendor lock-in, and raises barriers for new AI hardware. In this work, we make the case for combining just-in-time (JIT) compilation with kernel parameter autotuning to enable portable, state-of-the-art performance LLM execution without code changes. Focusing on flash attention -- a widespread performance-critical LLM kernel -- we demonstrate that this approach explores up to 15x more kernel parameter configurations, produces significantly more diverse code across multiple dimensions, and even outperforms vendor-optimized implementations by up to 230%, all while reducing kernel code size by 70x and eliminating manual code optimizations. Our results highlight autotuning as a promising path to unlocking model portability across GPU vendors.