SURGE: On the Potential of Large Language Models as General-Purpose Surrogate Code Executors

📅 2025-02-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates whether large language models (LLMs) can serve as universal “proxy code executors”—accurately predicting program outputs and behaviors without actual execution. Method: We introduce SURGE, a comprehensive benchmark comprising eight challenging domains: multi-language programming, competitive algorithm solving, repository-scale analysis, high-cost scientific computation, time-complexity-intensive tasks, erroneous code diagnosis, compiler- or environment-dependent programs, and formal mathematical proof verification. We conduct the first systematic evaluation of LLMs’ proxy execution feasibility across cross-lingual, high-cost, environment-sensitive, and formal verification dimensions, propose a novel error attribution taxonomy, and perform scaling studies across open- and closed-source models of varying sizes, integrating program behavior modeling and execution trace prediction. Contribution/Results: Results show that LLMs exhibit nascent proxy execution capability on certain tasks, but generalization remains limited; model scale and training data volume yield nonlinear performance gains.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have demonstrated remarkable capabilities in code-related tasks, such as code understanding and code generation. However, an equally important yet underexplored question is whether LLMs can serve as general-purpose surrogate code executors, to predict the output and behavior of a program without actually running it. To systematically investigate this capability, we introduce SURGE, a comprehensive benchmark covering eight key aspects: multi-language programming tasks, competition-level programming problems, repository-level code analysis, high-cost scientific computing, time-complexity-intensive algorithms, buggy code analysis, programs dependent on specific compilers or execution environments, and formal mathematical proof verification. We evaluate multiple open-source and proprietary LLMs on SURGE and conduct a scaling study to analyze the impact of model size and training data scale on surrogate execution accuracy. Additionally, we categorize model prediction errors and explore potential areas for improvement. Our findings indicate that while LLMs can predict code execution results in certain cases, they exhibit limitations in general-purpose surrogate execution. This study provides empirical insights into the feasibility of using LLMs as surrogate code executors. Code and dataset are released at https://github.com/Imbernoulli/SURGE.
Problem

Research questions and friction points this paper is trying to address.

Assess LLMs as surrogate code executors
Evaluate LLMs on multi-language programming tasks
Analyze LLMs' limitations in general-purpose execution
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs as surrogate executors
Comprehensive benchmark SURGE
Error analysis for improvement
🔎 Similar Papers
No similar papers found.