HipKittens: Fast and Furious AMD Kernels

๐Ÿ“… 2025-11-11
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the lack of an efficient AI kernel development paradigm for AMD GPUs (CDNA architecture), this paper introduces HKโ€”the first high-performance AI operator programming framework tailored for AMD. Methodologically, it systematically identifies CDNA-applicable programming primitives, designs block-based explicit memory management, fine-grained asynchronous execution, and worker-coordination mechanisms, and implements a C++-embedded domain-specific language (DSL) to establish a vendor-portable software abstraction layer. Contributions include: (1) the first AMD-specific high-performance programming model, breaking the NVIDIA-centric DSL monopoly; and (2) near-optimal or even assembly-level performance on core operatorsโ€”e.g., attention (d=64) and grouped-query attention (GQA) backward pass achieve 1.2โ€“2.4ร— speedup over state-of-the-art baselines, significantly outperforming compiler-generated code.

Technology Category

Application Category

๐Ÿ“ Abstract
AMD GPUs offer state-of-the-art compute and memory bandwidth; however, peak performance AMD kernels are written in raw assembly. To address the difficulty of mapping AI algorithms to hardware, recent work proposes C++ embedded and PyTorch-inspired domain-specific languages like ThunderKittens (TK) to simplify high performance AI kernel development on NVIDIA hardware. We explore the extent to which such primitives -- for explicit tile-based programming with optimized memory accesses and fine-grained asynchronous execution across workers -- are NVIDIA-specific or general. We provide the first detailed study of the programming primitives that lead to performant AMD AI kernels, and we encapsulate these insights in the HipKittens (HK) programming framework. We find that tile-based abstractions used in prior DSLs generalize to AMD GPUs, however we need to rethink the algorithms that instantiate these abstractions for AMD. We validate the HK primitives across CDNA3 and CDNA4 AMD platforms. In evaluations, HK kernels compete with AMD's hand-optimized assembly kernels for GEMMs and attention, and consistently outperform compiler baselines. Moreover, assembly is difficult to scale to the breadth of AI workloads; reflecting this, in some settings HK outperforms all available kernel baselines by $1.2-2.4 imes$ (e.g., $d=64$ attention, GQA backwards, memory-bound kernels). These findings help pave the way for a single, tile-based software layer for high-performance AI kernels that translates across GPU vendors. HipKittens is released at: https://github.com/HazyResearch/HipKittens.
Problem

Research questions and friction points this paper is trying to address.

Developing high-performance AI kernels for AMD GPUs using tile-based abstractions
Generalizing NVIDIA-specific programming primitives to AMD hardware architectures
Creating vendor-portable software layer for efficient AI workload execution
Innovation

Methods, ideas, or system contributions that make the work stand out.

Tile-based abstractions generalize to AMD GPUs
HipKittens framework enables high-performance AMD AI kernels
Kernels compete with hand-optimized assembly on AMD platforms
๐Ÿ”Ž Similar Papers
No similar papers found.
W
William Hu
Stanford University
D
Drew Wadsworth
Stanford University
S
Sean Siddens
Advanced Micro Devices, Inc.
S
Stanley Winata
Advanced Micro Devices, Inc.
D
Daniel Y. Fu
University of California, San Diego
R
Ryan Swann
Advanced Micro Devices, Inc.
M
Muhammad Osama
Advanced Micro Devices, Inc.
C
Christopher R'e
Stanford University
Simran Arora
Simran Arora
Computer Science, Stanford University
Computer ScienceAI Systems