WaveTune: Wave-aware Bilinear Modeling for Efficient GPU Kernel Auto-tuning

📅 2026-04-11

📈 Citations: 0

✨ Influential: 0

career value

271K/year

🤖 AI Summary

This work addresses the challenge that GPU kernel performance is highly sensitive to runtime parameters, and existing auto-tuning approaches struggle to balance configuration optimality with decision overhead. The authors propose a wave-aware auto-tuning framework that unifies diverse inputs through a parameter space mapping, introduces—for the first time—a wave-structure-aware bilinear latency prediction model, and integrates wave-guided sparse sampling with a lightweight dual-table lookup mechanism to drastically reduce tuning costs. Evaluated across three representative kernel types and five GPU architectures, the method achieves near-optimal performance, delivering up to 1.83× kernel-level speedup and reducing end-to-end first-token latency by up to 1.33×, while cutting decision overhead by five orders of magnitude compared to exhaustive search.

Technology Category

Application Category

📝 Abstract

The rapid adoption of Large Language Models (LLMs) has made GPU inference efficiency an increasingly critical system concern. The runtime of LLM workloads is largely dominated by tile-based kernels, particularly General Matrix Multiplications (GEMMs). Although these kernels are highly optimized, their performance remains sensitive to a large space of runtime parameters, such as tile sizes and pipeline stages. The interaction between these parameters and hardware resources leads to a non-convex optimization landscape. Existing approaches to parameter configuration -- including search-based auto-tuning, heuristic rules, and learned cost models -- face a fundamental trade-off between performance optimality and runtime efficiency. In this paper, we present WaveTune, a wave-aware framework for runtime kernel auto-tuning. First, we introduce a unified mapping method to handle input diversity and decompose the configuration space to manage high dimensionality. Second, we develop an analytical wave-aware bilinear model that accurately predicts kernel latency. Third, we design a sparse sampling scheme based on wave structures and a lightweight dual-table retrieval mechanism to minimize runtime overhead. As a result, WaveTune enables precise and efficient runtime configuration for GPU kernels. Across three representative kernels and five GPU architectures, WaveTune consistently achieves near-optimal kernel performance, delivering up to 1.83x kernel-level speedup and up to 1.33x end-to-end TTFT reduction, while reducing runtime decision overhead by five orders of magnitude compared to exhaustive search. These results demonstrate that WaveTune effectively eliminates the traditional trade-off between configuration latency and execution optimality, providing a practical and robust solution for high-performance LLM inference.

Problem

Research questions and friction points this paper is trying to address.

GPU kernel auto-tuning

Large Language Models

GEMM

runtime parameter optimization

non-convex optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

wave-aware modeling

bilinear latency prediction

GPU kernel auto-tuning