Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs

📅 2026-04-25

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

This study systematically evaluates the performance and portability of CUDA Tile across Hopper and Blackwell GPU architectures for AI workloads. Focusing on GEMM, fused multi-head attention, and large language model (LLM) inference, the authors benchmark implementations based on cuBLAS, Triton, WMMA, and SIMT under BF16/FP16 precision on the H100 NVL, B200, and RTX PRO 6000 Blackwell Server Edition GPUs. Their analysis reveals that CUDA Tile achieves exceptional performance on the datacenter-grade B200—reaching 1007 TFLOP/s in fused attention (2.5× faster than FlashAttention-2) and attaining 52–79% of cuBLAS’s GEMM throughput—yet demonstrates suboptimal optimization on consumer-class hardware and significantly weaker cross-architecture portability compared to Triton.

Technology Category

Application Category

📝 Abstract

NVIDIA's CUDA Tile (CuTile) introduces a Python-based, tile-centric abstraction for GPU kernel development that aims to simplify programming while retaining Tensor Core and Tensor Memory Accelerator (TMA) efficiency on modern GPUs. We present the first independent, cross-architecture evaluation of CuTile against established approaches such as cuBLAS, Triton, WMMA, and raw SIMT on three NVIDIA GPUs spanning Hopper and Blackwell: H100 NVL, B200, and RTX PRO 6000 Blackwell Server Edition. We benchmark representative AI workloads, including GEMM, fused multi-head attention, and end-to-end LLM inference in BF16/FP16 precision, to assess both performance and portability. Our results show that CuTile effectiveness is strongly workload- and architecture-dependent. On datacenter-class Blackwell (B200), CuTile achieves up to 1007 TFLOP/s for fused attention, outperforming FlashAttention-2 by 2.5x while requiring only 60 lines of Python kernel code. For GEMM, CuTile reaches 52-79% of cuBLAS performance in 22 lines of code (versus 123 for WMMA), making it a practical replacement for hand-written CUDA kernels but not yet for vendor-optimized libraries. However, the same CuTile attention kernel achieves only 53% of FlashAttention-2 throughput on RTX PRO 6000 (sm_120), exposing significant cross-architecture optimization gaps. In contrast, Triton sustains 62-101% of cuBLAS performance across all tested platforms without architecture-specific tuning, demonstrating substantially stronger portability.

Problem

Research questions and friction points this paper is trying to address.

CUDA Tile

AI workloads

GPU architecture

performance portability

Tensor Cores

Innovation

Methods, ideas, or system contributions that make the work stand out.

CUDA Tile

Tensor Core

cross-architecture evaluation