Omniwise: Predicting GPU Kernels Performance with LLMs

📅 2025-06-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of GPU kernel performance prediction. We propose Omniwise, the first end-to-end, self-supervised fine-tuned large language model (LLM) framework for direct source-code–based prediction of multidimensional performance metrics—including memory bandwidth, cache hit rate, GFLOPs, and arithmetic intensity—without requiring code execution or conventional profiling tools. Our key contributions are: (1) the first application of LLMs to GPU kernel performance modeling, enabling joint multi-metric prediction; (2) a lightweight, model-agnostic 3B-parameter architecture designed for efficient deployment; and (3) self-supervised fine-tuning integrated with a VS Code plugin to close the development loop. Evaluation on AMD MI250 and MI300X platforms demonstrates that over 90% of predictions achieve relative error below 10%, significantly accelerating performance tuning in HPC and AI systems.

Technology Category

Application Category

📝 Abstract
In recent years, the rapid advancement of deep neural networks (DNNs) has revolutionized artificial intelligence, enabling models with unprecedented capabilities in understanding, generating, and processing complex data. These powerful architectures have transformed a wide range of downstream applications, tackling tasks beyond human reach. In this paper, we introduce Omniwise, the first end-to-end, self-supervised fine-tuning pipeline that applies large language models (LLMs) to GPU kernel performance prediction--a novel use case in performance profiling. Omniwise is model-agnostic and lightweight, achieving strong results even with a small 3B-parameter model. It can predict key performance metrics, including memory bandwidth, cache hit rates, GFLOPs, and arithmetic intensity, directly from kernel code without the need for code execution or profiling tools. Our approach achieves over 90% of predictions within 10% relative error on GPU kernels executed on AMD MI250 and MI300X architectures. In addition to the pipeline, we develop an online inference server and a Visual Studio Code plugin that seamlessly integrate LLM-based performance prediction into developers' workflows.
Problem

Research questions and friction points this paper is trying to address.

Predict GPU kernel performance using LLMs
Estimate metrics without code execution
Integrate prediction into developer workflows
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based GPU kernel performance prediction
Self-supervised fine-tuning pipeline
Model-agnostic lightweight 3B-parameter architecture
🔎 Similar Papers
No similar papers found.
Zixian Wang
Zixian Wang
University of California, San Diego
C
Cole Ramos
AMD, Austin, TX, USA
M
Muhammad A. Awad
AMD, Santa Clara, CA, USA
K
Keith Lowery
AMD, Austin, TX, USA