MultiKernelBench: A Multi-Platform Benchmark for Kernel Generation

📅 2025-07-19

📈 Citations: 0

✨ Influential: 0

career value

232K/year

🤖 AI Summary

Existing deep learning (DL) operator generation benchmarks suffer from limited hardware support, coarse-grained kernel categorization, and imbalanced task coverage. To address these limitations, we propose the first large language model (LLM)-oriented, multi-platform DL operator generation benchmark, encompassing 285 tasks across 14 fine-grained categories and supporting NVIDIA GPUs, Huawei NPUs, and Google TPUs. We introduce a modular backend abstraction layer to ensure hardware extensibility and a category-aware one-shot prompting strategy to enhance generation quality. Comprehensive evaluation of seven mainstream LLMs reveals their sensitivity to task difficulty and critical bottlenecks in cross-platform generalization. Empirical results demonstrate that targeted prompting significantly improves kernel generation correctness. The benchmark is publicly released, establishing a standardized evaluation infrastructure for automated kernel generation research.

Technology Category

Application Category

📝 Abstract

The automatic generation of deep learning (DL) kernels using large language models (LLMs) has emerged as a promising approach to reduce the manual effort and hardware-specific expertise required for writing high-performance operator implementations. However, existing benchmarks for evaluating LLMs in this domain suffer from limited hardware support, coarse-grained kernel categorization, and imbalanced task coverage. To address these limitations, we introduce MultiKernelBench, the first comprehensive, multi-platform benchmark for LLM-based DL kernel generation. MultiKernelBench spans 285 tasks across 14 well-defined kernel categories and supports three major hardware platforms: Nvidia GPUs, Huawei NPUs, and Google TPUs. To enable future extensibility, we design a modular backend abstraction layer that decouples platform-specific logic from the core benchmarking infrastructure, allowing easy integration of new hardware platforms. We further propose a simple yet effective category-aware one-shot prompting method that improves generation quality by providing in-category exemplars. Through systematic evaluations of seven state-of-the-art LLMs, we reveal significant variation in task difficulty, poor generalization to platforms with less training exposure, and the effectiveness of targeted prompting strategies. MultiKernelBench is publicly available at https://github.com/wzzll123/MultiKernelBench.

Problem

Research questions and friction points this paper is trying to address.

Evaluate LLMs for DL kernel generation across multiple hardware platforms

Address limited hardware support and imbalanced task coverage in benchmarks

Improve kernel generation quality with category-aware prompting methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-platform benchmark for kernel generation

Modular backend abstraction for extensibility

Category-aware one-shot prompting method

🔎 Similar Papers

No similar papers found.