MultiKernelBench: A Multi-Platform Benchmark for Kernel Generation

📅 2025-07-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing deep learning (DL) operator generation benchmarks suffer from limited hardware support, coarse-grained kernel categorization, and imbalanced task coverage. To address these limitations, we propose the first large language model (LLM)-oriented, multi-platform DL operator generation benchmark, encompassing 285 tasks across 14 fine-grained categories and supporting NVIDIA GPUs, Huawei NPUs, and Google TPUs. We introduce a modular backend abstraction layer to ensure hardware extensibility and a category-aware one-shot prompting strategy to enhance generation quality. Comprehensive evaluation of seven mainstream LLMs reveals their sensitivity to task difficulty and critical bottlenecks in cross-platform generalization. Empirical results demonstrate that targeted prompting significantly improves kernel generation correctness. The benchmark is publicly released, establishing a standardized evaluation infrastructure for automated kernel generation research.

Technology Category

Application Category

📝 Abstract
The automatic generation of deep learning (DL) kernels using large language models (LLMs) has emerged as a promising approach to reduce the manual effort and hardware-specific expertise required for writing high-performance operator implementations. However, existing benchmarks for evaluating LLMs in this domain suffer from limited hardware support, coarse-grained kernel categorization, and imbalanced task coverage. To address these limitations, we introduce MultiKernelBench, the first comprehensive, multi-platform benchmark for LLM-based DL kernel generation. MultiKernelBench spans 285 tasks across 14 well-defined kernel categories and supports three major hardware platforms: Nvidia GPUs, Huawei NPUs, and Google TPUs. To enable future extensibility, we design a modular backend abstraction layer that decouples platform-specific logic from the core benchmarking infrastructure, allowing easy integration of new hardware platforms. We further propose a simple yet effective category-aware one-shot prompting method that improves generation quality by providing in-category exemplars. Through systematic evaluations of seven state-of-the-art LLMs, we reveal significant variation in task difficulty, poor generalization to platforms with less training exposure, and the effectiveness of targeted prompting strategies. MultiKernelBench is publicly available at https://github.com/wzzll123/MultiKernelBench.
Problem

Research questions and friction points this paper is trying to address.

Evaluate LLMs for DL kernel generation across multiple hardware platforms
Address limited hardware support and imbalanced task coverage in benchmarks
Improve kernel generation quality with category-aware prompting methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-platform benchmark for kernel generation
Modular backend abstraction for extensibility
Category-aware one-shot prompting method
🔎 Similar Papers
No similar papers found.
Z
Zhongzhen Wen
State Key Lab for Novel Software Technology, Nanjing University
Yinghui Zhang
Yinghui Zhang
XUPT & SMU
Public Key CryptographyCloud SecurityNetwork Security
Z
Zhong Li
State Key Lab for Novel Software Technology, Nanjing University
Zhongxin Liu
Zhongxin Liu
Zhejiang University
Software EngineeringLarge Language Models
L
Linna Xie
State Key Lab for Novel Software Technology, Nanjing University
T
Tian Zhang
State Key Lab for Novel Software Technology, Nanjing University