🤖 AI Summary
Tensor operators consume over 90% of computational resources in LLMs and deep learning, yet manual optimization is time-consuming and exhibits poor portability across heterogeneous hardware (e.g., RISC-V, ARM, GPU). Method: This paper proposes the first hardware-primitive-aware, large-model-driven automatic operator generation framework. It innovatively injects hardware semantics into the LLM generation pipeline to jointly optimize operator structure and tunable parameters. The framework integrates hardware-aware prompt engineering, template-constrained decoding, a lightweight auto-tuner, and a multi-platform performance feedback loop, enabling zero-shot cross-architecture deployment. Contribution/Results: Experiments show a 1,291× speedup in operator generation over baseline LLMs; achieved 251% of OpenBLAS performance on RISC-V and 124% of cuBLAS performance on GPU; and reduced development effort by 200×.
📝 Abstract
Computation-intensive tensor operators constitute over 90% of the computations in Large Language Models (LLMs) and Deep Neural Networks.Automatically and efficiently generating high-performance tensor operators with hardware primitives is crucial for diverse and ever-evolving hardware architectures like RISC-V, ARM, and GPUs, as manually optimized implementation takes at least months and lacks portability.LLMs excel at generating high-level language codes, but they struggle to fully comprehend hardware characteristics and produce high-performance tensor operators. We introduce a tensor-operator auto-generation framework with a one-line user prompt (QiMeng-TensorOp), which enables LLMs to automatically exploit hardware characteristics to generate tensor operators with hardware primitives, and tune parameters for optimal performance across diverse hardware. Experimental results on various hardware platforms, SOTA LLMs, and typical tensor operators demonstrate that QiMeng-TensorOp effectively unleashes the computing capability of various hardware platforms, and automatically generates tensor operators of superior performance. Compared with vanilla LLMs, QiMeng-TensorOp achieves up to $1291 imes$ performance improvement. Even compared with human experts, QiMeng-TensorOp could reach $251 %$ of OpenBLAS on RISC-V CPUs, and $124 %$ of cuBLAS on NVIDIA GPUs. Additionally, QiMeng-TensorOp also significantly reduces development costs by $200 imes$ compared with human experts.