QiMeng-TensorOp: Automatically Generating High-Performance Tensor Operators with Hardware Primitives

📅 2025-05-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Tensor operators consume over 90% of computational resources in LLMs and deep learning, yet manual optimization is time-consuming and exhibits poor portability across heterogeneous hardware (e.g., RISC-V, ARM, GPU). Method: This paper proposes the first hardware-primitive-aware, large-model-driven automatic operator generation framework. It innovatively injects hardware semantics into the LLM generation pipeline to jointly optimize operator structure and tunable parameters. The framework integrates hardware-aware prompt engineering, template-constrained decoding, a lightweight auto-tuner, and a multi-platform performance feedback loop, enabling zero-shot cross-architecture deployment. Contribution/Results: Experiments show a 1,291× speedup in operator generation over baseline LLMs; achieved 251% of OpenBLAS performance on RISC-V and 124% of cuBLAS performance on GPU; and reduced development effort by 200×.

Technology Category

Application Category

📝 Abstract
Computation-intensive tensor operators constitute over 90% of the computations in Large Language Models (LLMs) and Deep Neural Networks.Automatically and efficiently generating high-performance tensor operators with hardware primitives is crucial for diverse and ever-evolving hardware architectures like RISC-V, ARM, and GPUs, as manually optimized implementation takes at least months and lacks portability.LLMs excel at generating high-level language codes, but they struggle to fully comprehend hardware characteristics and produce high-performance tensor operators. We introduce a tensor-operator auto-generation framework with a one-line user prompt (QiMeng-TensorOp), which enables LLMs to automatically exploit hardware characteristics to generate tensor operators with hardware primitives, and tune parameters for optimal performance across diverse hardware. Experimental results on various hardware platforms, SOTA LLMs, and typical tensor operators demonstrate that QiMeng-TensorOp effectively unleashes the computing capability of various hardware platforms, and automatically generates tensor operators of superior performance. Compared with vanilla LLMs, QiMeng-TensorOp achieves up to $1291 imes$ performance improvement. Even compared with human experts, QiMeng-TensorOp could reach $251 %$ of OpenBLAS on RISC-V CPUs, and $124 %$ of cuBLAS on NVIDIA GPUs. Additionally, QiMeng-TensorOp also significantly reduces development costs by $200 imes$ compared with human experts.
Problem

Research questions and friction points this paper is trying to address.

Automating high-performance tensor operator generation for diverse hardware
Overcoming LLMs' limitations in hardware-aware code optimization
Reducing development costs while improving computational efficiency significantly
Innovation

Methods, ideas, or system contributions that make the work stand out.

Auto-generates tensor operators with hardware primitives
Optimizes performance across diverse hardware platforms
Reduces development costs significantly compared to humans
🔎 Similar Papers
No similar papers found.
X
Xuzhi Zhang
Institute of Software Chinese Academy of Sciences
Shaohui Peng
Shaohui Peng
Institute of Software Chinese Academy of Sciences
Embodied AIReinforcement Learning
Q
Qirui Zhou
Institute of Computing Technology, Chinese Academy of Sciences
Yuanbo Wen
Yuanbo Wen
Institute of Computing Technology, Chinese Academy of Sciences
Machine Learning System
Q
Qi Guo
Institute of Computing Technology, Chinese Academy of Sciences
Ruizhi Chen
Ruizhi Chen
Institute of Software Chinese Academy of Sciences
X
Xinguo Zhu
Institute of Software Chinese Academy of Sciences
W
Weiqiang Xiong
Institute of Software Chinese Academy of Sciences
H
Haixin Chen
Institute of Computing Technology, Chinese Academy of Sciences
C
Congying Ma
Peking University
K
Ke Gao
Institute of Software Chinese Academy of Sciences
C
Chen Zhao
Institute of Software Chinese Academy of Sciences
Yanjun Wu
Yanjun Wu
Institute of Software, Chinese Academy of Sciences
Computer Science
Yunji Chen
Yunji Chen
Institute of Computing Technology, Chinese Academy of Sciences
processor architecturemicroarchitecturemachine learning
L
Ling Li
Institute of Software Chinese Academy of Sciences