QiMeng-Xpiler: Transcompiling Tensor Programs for Deep Learning Systems with a Neural-Symbolic Approach

📅 2025-05-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Addressing the challenges of automated tensor program translation across heterogeneous deep learning hardware (e.g., GPUs and ASICs)—including low automation, labor-intensive manual adaptation, and frequent functional errors—this paper proposes a neuro-symbolic co-compilation paradigm. It leverages large language models (LLMs) to guide lightweight symbolic program synthesis for semantics-preserving code repair, integrated with a meta-prompt-driven compilation workflow and a hierarchical auto-tuning mechanism to jointly optimize transformation sequences and execution parameters. Evaluated on four heterogeneous hardware platforms, our approach achieves an average translation accuracy of 95%, delivers up to 2.0× higher performance than vendor-optimized hand-tuned libraries, and improves programming efficiency by up to 96×. To the best of our knowledge, this is the first work to systematically realize “write once, run anywhere” for deep learning compilation while guaranteeing functional correctness.

Technology Category

Application Category

📝 Abstract
Heterogeneous deep learning systems (DLS) such as GPUs and ASICs have been widely deployed in industrial data centers, which requires to develop multiple low-level tensor programs for different platforms. An attractive solution to relieve the programming burden is to transcompile the legacy code of one platform to others. However, current transcompilation techniques struggle with either tremendous manual efforts or functional incorrectness, rendering"Write Once, Run Anywhere"of tensor programs an open question. We propose a novel transcompiler, i.e., QiMeng-Xpiler, for automatically translating tensor programs across DLS via both large language models (LLMs) and symbolic program synthesis, i.e., neural-symbolic synthesis. The key insight is leveraging the powerful code generation ability of LLM to make costly search-based symbolic synthesis computationally tractable. Concretely, we propose multiple LLM-assisted compilation passes via pre-defined meta-prompts for program transformation. During each program transformation, efficient symbolic program synthesis is employed to repair incorrect code snippets with a limited scale. To attain high performance, we propose a hierarchical auto-tuning approach to systematically explore both the parameters and sequences of transformation passes. Experiments on 4 DLS with distinct programming interfaces, i.e., Intel DL Boost with VNNI, NVIDIA GPU with CUDA, AMD MI with HIP, and Cambricon MLU with BANG, demonstrate that QiMeng-Xpiler correctly translates different tensor programs at the accuracy of 95% on average, and the performance of translated programs achieves up to 2.0x over vendor-provided manually-optimized libraries. As a result, the programming productivity of DLS is improved by up to 96.0x via transcompiling legacy tensor programs.
Problem

Research questions and friction points this paper is trying to address.

Automating tensor program translation across diverse deep learning systems
Reducing manual effort and ensuring correctness in code transcompilation
Enhancing performance and productivity in heterogeneous DLS environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Neural-symbolic synthesis for tensor program transcompilation
LLM-assisted compilation passes with meta-prompts
Hierarchical auto-tuning for performance optimization
🔎 Similar Papers
No similar papers found.
S
Shouyang Dong
University of Science and Technology of China, Cambricon Technologies
Yuanbo Wen
Yuanbo Wen
Institute of Computing Technology, Chinese Academy of Sciences
Machine Learning System
Jun Bi
Jun Bi
SKL of Processors, Institute of Computing Technology, Chinese Academy of Sciences
D
Di Huang
SKL of Processors, Institute of Computing Technology, Chinese Academy of Sciences
Jiaming Guo
Jiaming Guo
Institute of Computing Technology, Chinese Academy of Sciences
Artificial intelligenceReinforcement Learning
J
Jianxing Xu
University of Science and Technology of China, SKL of Processors, Institute of Computing Technology, Chinese Academy of Sciences
R
Ruibai Xu
University of Science and Technology of China, SKL of Processors, Institute of Computing Technology, Chinese Academy of Sciences
Xinkai Song
Xinkai Song
Institute of Computing Technology, Chinese Academy of Sciences
reinforment learninggamingAlphaGoneural network quantizationintegrated circuit design
Y
Yifan Hao
SKL of Processors, Institute of Computing Technology, Chinese Academy of Sciences
X
Xuehai Zhou
University of Science and Technology of China
T
Tianshi Chen
Cambricon Technologies
Q
Qi Guo
SKL of Processors, Institute of Computing Technology, Chinese Academy of Sciences
Yunji Chen
Yunji Chen
Institute of Computing Technology, Chinese Academy of Sciences
processor architecturemicroarchitecturemachine learning