TritonBench: Benchmarking Large Language Model Capabilities for Generating Triton Operators

📅 2025-02-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current large language models (LLMs) frequently produce Triton GPU kernels with functional errors and suboptimal performance, primarily due to insufficient understanding of Triton programming conventions and GPU hardware characteristics, compounded by the absence of a dedicated evaluation benchmark. To address this, we introduce TritonBench—the first comprehensive benchmark for Triton kernel generation—comprising 184 real-world GitHub kernels and their PyTorch-aligned implementations. We pioneer the integration of empirical GPU performance metrics (kernel latency and throughput) into code generation evaluation, proposing a dual-axis assessment framework: functional correctness plus hardware-aware performance. TritonBench incorporates PyTorch interface specifications, CUDA configuration profiling, an open-source kernel dataset, and an automated verification pipeline. Experiments reveal that state-of-the-art LLMs achieve an average performance compliance rate of less than 12%, exposing a critical bottleneck in LLM–AI compiler co-optimization. TritonBench establishes a reproducible, extensible evaluation infrastructure for high-performance operator synthesis.

Technology Category

Application Category

📝 Abstract
Triton, a high-level Python-like language designed for building efficient GPU kernels, is widely adopted in deep learning frameworks due to its portability, flexibility, and accessibility. However, programming and parallel optimization still require considerable trial and error from Triton developers. Despite advances in large language models (LLMs) for conventional code generation, these models struggle to generate accurate, performance-optimized Triton code, as they lack awareness of its specifications and the complexities of GPU programming. More critically, there is an urgent need for systematic evaluations tailored to Triton. In this work, we introduce TritonBench, the first comprehensive benchmark for Triton operator generation. TritonBench features two evaluation channels: a curated set of 184 real-world operators from GitHub and a collection of operators aligned with PyTorch interfaces. Unlike conventional code benchmarks prioritizing functional correctness, TritonBench also profiles efficiency performance on widely deployed GPUs aligned with industry applications. Our study reveals that current state-of-the-art code LLMs struggle to generate efficient Triton operators, highlighting a significant gap in high-performance code generation. TritonBench will be available at https://github.com/thunlp/TritonBench.
Problem

Research questions and friction points this paper is trying to address.

Benchmarking LLMs for Triton code generation
Evaluating efficiency of GPU-optimized Triton operators
Addressing gaps in high-performance GPU code generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

TritonBench benchmarks LLMs
Evaluates Triton operator efficiency
Includes real-world GitHub operators
🔎 Similar Papers
No similar papers found.
J
Jianling Li
Tianjin University, Tianjin, China
S
Shangzhan Li
Harbin Institute of Technology, Harbin, China
Z
Zhenye Gao
The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China
Q
Qi Shi
Tsinghua University, Beijing, China
Y
Yuxuan Li
Tsinghua University, Beijing, China
Zefan Wang
Zefan Wang
Tsinghua University
machine learning
Jiacheng Huang
Jiacheng Huang
Tsinghua University, Beijing, China
H
Haojie Wang
Tsinghua University, Beijing, China
J
Jianrong Wang
Tianjin University, Tianjin, China
X
Xu Han
Tsinghua University, Beijing, China
Z
Zhiyuan Liu
Tsinghua University, Beijing, China
Maosong Sun
Maosong Sun
Professor of Computer Science and Technology, Tsinghua University
Natural Language ProcessingArtificial IntelligenceSocial Computing