🤖 AI Summary
This work investigates the feasibility of the “Large Language Model as a Compiler” (LaaC) paradigm—i.e., whether LLMs can perform end-to-end, precise compilation from source code to target assembly. To this end, the authors introduce CompilerEval†, the first benchmark dataset specifically designed for compilation tasks, comprising multilingual source code paired with cross-platform (x86, ARM, RISC-V) assembly. They develop a dedicated evaluation framework and employ prompt engineering, chain-of-thought reasoning, and model scaling to systematically assess the source-code understanding and assembly-generation capabilities of leading open- and closed-source LLMs. Results show that current LLMs possess foundational compilation ability; targeted optimizations substantially improve assembly correctness and overall compilation success rates. This study provides the first systematic empirical validation of LaaC’s technical viability, proposes principled architectural guidelines and evolutionary pathways for compilation-oriented LLMs, and establishes a foundation for AI-native compiler research.
📝 Abstract
In recent years, end-to-end Large Language Model (LLM) technology has shown substantial advantages across various domains. As critical system software and infrastructure, compilers are responsible for transforming source code into target code. While LLMs have been leveraged to assist in compiler development and maintenance, their potential as an end-to-end compiler remains largely unexplored. This paper explores the feasibility of LLM as a Compiler (LaaC) and its future directions. We designed the CompilerEval† dataset and framework specifically to evaluate the capabilities of mainstream LLMs in source code comprehension and assembly code generation. In the evaluation, we analyzed various errors, explored multiple methods to improve LLM-generated code, and evaluated cross-platform compilation capabilities. Experimental results demonstrate that LLMs exhibit basic capabilities as compilers but currently achieve low compilation success rates. By optimizing prompts, scaling up the model, and incorporating reasoning methods, the quality of assembly code generated by LLMs can be significantly enhanced. Based on these findings, we maintain an optimistic outlook for LaaC and propose practical architectural designs and future research directions. We believe that with targeted training, knowledge-rich prompts, and specialized infrastructure, LaaC has the potential to generate high-quality assembly code and drive a paradigm shift in the field of compilation.