Stream-K++: Adaptive GPU GEMM Kernel Scheduling and Selection using Bloom Filters

📅 2024-08-21
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF

career value

241K/year
🤖 AI Summary
To address performance bottlenecks of General Matrix Multiplication (GEMM) on GPUs under emerging AI workloads and heterogeneous architectures, this paper proposes an adaptive kernel scheduling and selection framework. Methodologically, it extends the Stream-K scheduling strategy from three to seven variants and introduces a novel Bloom-filter-based configuration pruning mechanism with zero false positives, eliminating up to 95.8% of invalid configurations while preserving completeness. The framework integrates AMD’s Composable Kernel library and the Opensieve C++ framework, enabling deep architecture-specific adaptation and optimization for the MI250X GPU. Experimental results demonstrate a 43% improvement in GEMM peak performance on the AMD MI250X, with sustained performance exceeding 80% of the optimal configuration across 60–97.6% of problem sizes.

Technology Category

Application Category

📝 Abstract
General matrix multiplication (GEMM) operations are crucial in various computational fields. As GPU architectures evolve, optimizing GEMM performance becomes increasingly important. This paper introduces Stream-K++, an enhancement to the promising Stream-K GEMM scheduling algorithm. We expand Stream-K's scheduling policies from three to seven and implement an efficient solution selection mechanism using Bloom filters. Our approach rapidly eliminates up to 95.8% of unsuitable configurations while maintaining a 100% true-negative rate. Implemented using the AMD Composable Kernel library and evaluated on AMD Instinct MI250X GPUs, Stream-K++ demonstrates significant performance gains (up to 43%) in select scenarios. It remains competitive (within 20% of optimal) for 60-97.6% of problem sizes. Our flexible framework, implemented in the Opensieve C++ library, allows for easy adaptation to new problem sizes, scheduling policies, or additional tuning parameters, paving the way for future optimizations in GPU-based GEMM operations.
Problem

Research questions and friction points this paper is trying to address.

Optimizing GPU GEMM performance for evolving AI workloads
Enhancing workload balancing through adaptive kernel scheduling
Selecting efficient configurations using Bloom filter elimination
Innovation

Methods, ideas, or system contributions that make the work stand out.

Enhances Stream-K with seven scheduling policies
Uses Bloom filters for efficient solution selection
Achieves up to 43% performance gain on GPUs
🔎 Similar Papers
No similar papers found.