Nautilus: An Auto-Scheduling Tensor Compiler for Efficient Tiled GPU Kernels

📅 2026-04-16

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

This work addresses the challenge of automatically generating high-performance GPU tiled kernels from high-level tensor algebra expressions, thereby alleviating the burden of manual optimization. The authors propose an end-to-end compilation framework that integrates layer-wise lowering, expression rewriting, automated schedule search, reduction fusion, and tiling optimizations. For the first time, this framework automatically discovers high-efficiency kernels—comparable to FlashAttention-3—from the mathematical specification of attention operators, while introducing a novel scheduler that preserves program structural regularity. Evaluated on GH200 and RTX 5090 GPUs, the generated kernels achieve up to 23% and 42% higher throughput, respectively, and match or surpass hand-optimized cuDNN kernels across multiple long-sequence configurations.

Technology Category

Application Category

📝 Abstract

We present Nautilus, a novel tensor compiler that moves toward fully automated math-to-kernel optimization. Nautilus compiles a high-level algebraic specification of tensor operators into efficient tiled GPU kernels. Nautilus's successive lowering design allows high-level optimizations, expression rewrites, and tile optimizations to be jointly applied in a single end-to-end system. Nautilus presents a novel auto-scheduler that discovers sequences of high-level optimizations, while preserving the regular program structure needed by tile optimizers. Nautilus's auto-scheduler captures complex interactions and trade-offs in the high-level optimizations, including aggressive global transformations like advanced reduction fusion. Nautilus is the first end-to-end tensor compiler capable of starting from a math-like description of attention and automatically discovering FlashAttention-3-like kernels, offloading the entire burden of optimization from the programmer to the compiler. Across five transformer-based models and 150 evaluation configurations on NVIDIA GH200 and RTX 5090 GPUs, Nautilus achieves up to 23% higher throughput than state-of-the-art compilers on GH200 and up to 42% on RTX 5090, while matching or exceeding manually written cuDNN kernels on many long-sequence configurations.

Problem

Research questions and friction points this paper is trying to address.

tensor compilation

GPU kernel optimization

auto-scheduling

tiled kernels

high-level optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

auto-scheduling

tensor compiler

tiled GPU kernels