Toward Portable GPU Performance: Julia Recursive Implementation of TRMM and TRSM

📅 2025-04-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
GPU-accelerated triangular matrix–matrix multiplication (TRMM) and triangular solve (TRSM) kernels suffer from poor portability and inconsistent performance across heterogeneous GPU architectures (NVIDIA, AMD, Apple Silicon). Method: This paper proposes a unified algorithmic framework based on recursive divide-and-conquer and GEMM redirection. It reformulates TRMM/TRSM as sequences of standard GEMM calls, optimizes memory access patterns and compute–memory overlap, and leverages Julia’s multiple dispatch and metaprogramming to achieve hardware-agnostic abstraction. Contribution/Results: The implementation—requiring only ~500 lines of code—enables single-API deployment across all target platforms. It delivers the first high-performance TRMM/TRSM implementation on Apple Silicon GPUs; for large matrices, throughput approaches that of cuBLAS and rocBLAS, while cross-architecture performance variance is substantially reduced. This work breaks the long-standing trade-off between high performance and high portability in GPU linear algebra kernels.

Technology Category

Application Category

📝 Abstract
This paper presents a performant and portable recursive implementation of triangular matrix-matrix multiplication (TRMM) and triangular solve (TRSM) in Julia for GPUs, two kernels that underlie many linear-algebra algorithms. We restructure TRMM and TRSM so that most work is executed as general matrix-matrix multiplication (GEMM), improving use of the GPU memory hierarchy and reducing latency. Exploiting Julia's multiple dispatch and metaprogramming together with the GPUArrays and KernelAbstractions frameworks, we expose a single hardware-agnostic API that runs on NVIDIA, AMD, and Apple Silicon GPUs. For large matrices the recursive code reaches throughput comparable to vendor libraries such as cuBLAS and rocBLAS, while providing these routines on Apple Silicon for the first time. The entire implementation is only a few hundred lines of code, showing that unified Julia programs can deliver near-vendor performance across heterogeneous architectures.
Problem

Research questions and friction points this paper is trying to address.

Develop portable GPU implementation for TRMM and TRSM
Optimize matrix operations via GEMM for GPU efficiency
Unify hardware-agnostic API for multiple GPU platforms
Innovation

Methods, ideas, or system contributions that make the work stand out.

Recursive restructuring for GEMM-based execution
Julia's multiple dispatch for hardware-agnostic API
Unified code achieves near-vendor performance
🔎 Similar Papers
No similar papers found.
V
Vicki Carrica
Computer Science & Artificial Intelligence Laboratory, Massachusetts Institute of Technology, USA.
M
Maxwell Onyango
Computer Science & Artificial Intelligence Laboratory, Massachusetts Institute of Technology, USA.
R
Rabab Alomairy
Computer Science & Artificial Intelligence Laboratory, Massachusetts Institute of Technology, USA.
Evelyne Ringoot
Evelyne Ringoot
PhD candidate, Massachusetts Institute of Technology
High-Performance Computing Linear Algebra
James Schloss
James Schloss
Leios Labs LLC
PhysicsComputer ScienceQuantumGPU computingAlgorithms
Alan Edelman
Alan Edelman
Professor of Applied Mathematics, Member Computer Science AI LABS, MIT
CorgisRandom Matrix TheoryJuliaNumerical Linear AlgebraParallel Computing