TritonBench: Benchmarking Large Language Model Capabilities for Generating Triton Operators

📅 2025-02-20

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Current large language models (LLMs) frequently produce Triton GPU kernels with functional errors and suboptimal performance, primarily due to insufficient understanding of Triton programming conventions and GPU hardware characteristics, compounded by the absence of a dedicated evaluation benchmark. To address this, we introduce TritonBench—the first comprehensive benchmark for Triton kernel generation—comprising 184 real-world GitHub kernels and their PyTorch-aligned implementations. We pioneer the integration of empirical GPU performance metrics (kernel latency and throughput) into code generation evaluation, proposing a dual-axis assessment framework: functional correctness plus hardware-aware performance. TritonBench incorporates PyTorch interface specifications, CUDA configuration profiling, an open-source kernel dataset, and an automated verification pipeline. Experiments reveal that state-of-the-art LLMs achieve an average performance compliance rate of less than 12%, exposing a critical bottleneck in LLM–AI compiler co-optimization. TritonBench establishes a reproducible, extensible evaluation infrastructure for high-performance operator synthesis.

Technology Category

Application Category

📝 Abstract

Triton, a high-level Python-like language designed for building efficient GPU kernels, is widely adopted in deep learning frameworks due to its portability, flexibility, and accessibility. However, programming and parallel optimization still require considerable trial and error from Triton developers. Despite advances in large language models (LLMs) for conventional code generation, these models struggle to generate accurate, performance-optimized Triton code, as they lack awareness of its specifications and the complexities of GPU programming. More critically, there is an urgent need for systematic evaluations tailored to Triton. In this work, we introduce TritonBench, the first comprehensive benchmark for Triton operator generation. TritonBench features two evaluation channels: a curated set of 184 real-world operators from GitHub and a collection of operators aligned with PyTorch interfaces. Unlike conventional code benchmarks prioritizing functional correctness, TritonBench also profiles efficiency performance on widely deployed GPUs aligned with industry applications. Our study reveals that current state-of-the-art code LLMs struggle to generate efficient Triton operators, highlighting a significant gap in high-performance code generation. TritonBench will be available at https://github.com/thunlp/TritonBench.

Problem

Research questions and friction points this paper is trying to address.

Benchmarking LLMs for Triton code generation

Evaluating efficiency of GPU-optimized Triton operators

Addressing gaps in high-performance GPU code generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

TritonBench benchmarks LLMs

Evaluates Triton operator efficiency

Includes real-world GitHub operators

🔎 Similar Papers

No similar papers found.