QiMeng-TensorOp: Automatically Generating High-Performance Tensor Operators with Hardware Primitives

📅 2025-05-08

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Tensor operators consume over 90% of computational resources in LLMs and deep learning, yet manual optimization is time-consuming and exhibits poor portability across heterogeneous hardware (e.g., RISC-V, ARM, GPU). Method: This paper proposes the first hardware-primitive-aware, large-model-driven automatic operator generation framework. It innovatively injects hardware semantics into the LLM generation pipeline to jointly optimize operator structure and tunable parameters. The framework integrates hardware-aware prompt engineering, template-constrained decoding, a lightweight auto-tuner, and a multi-platform performance feedback loop, enabling zero-shot cross-architecture deployment. Contribution/Results: Experiments show a 1,291× speedup in operator generation over baseline LLMs; achieved 251% of OpenBLAS performance on RISC-V and 124% of cuBLAS performance on GPU; and reduced development effort by 200×.

Technology Category

Application Category

📝 Abstract

Computation-intensive tensor operators constitute over 90% of the computations in Large Language Models (LLMs) and Deep Neural Networks.Automatically and efficiently generating high-performance tensor operators with hardware primitives is crucial for diverse and ever-evolving hardware architectures like RISC-V, ARM, and GPUs, as manually optimized implementation takes at least months and lacks portability.LLMs excel at generating high-level language codes, but they struggle to fully comprehend hardware characteristics and produce high-performance tensor operators. We introduce a tensor-operator auto-generation framework with a one-line user prompt (QiMeng-TensorOp), which enables LLMs to automatically exploit hardware characteristics to generate tensor operators with hardware primitives, and tune parameters for optimal performance across diverse hardware. Experimental results on various hardware platforms, SOTA LLMs, and typical tensor operators demonstrate that QiMeng-TensorOp effectively unleashes the computing capability of various hardware platforms, and automatically generates tensor operators of superior performance. Compared with vanilla LLMs, QiMeng-TensorOp achieves up to $1291 imes$ performance improvement. Even compared with human experts, QiMeng-TensorOp could reach $251 %$ of OpenBLAS on RISC-V CPUs, and $124 %$ of cuBLAS on NVIDIA GPUs. Additionally, QiMeng-TensorOp also significantly reduces development costs by $200 imes$ compared with human experts.

Problem

Research questions and friction points this paper is trying to address.

Automating high-performance tensor operator generation for diverse hardware

Overcoming LLMs' limitations in hardware-aware code optimization

Reducing development costs while improving computational efficiency significantly

Innovation

Methods, ideas, or system contributions that make the work stand out.

Auto-generates tensor operators with hardware primitives

Optimizes performance across diverse hardware platforms

Reduces development costs significantly compared to humans

🔎 Similar Papers

No similar papers found.

ByteDance

United States / China / Singapore

Senior High-Performance LLM Training Engineer

Nvidia

base salary range is 184,000 USD - 287,500 USD for Level 4, and 224,000 USD - 356,500 USD for Level 5; equity and benefits

US, CA, Santa Clara

Authors to Follow