🤖 AI Summary
Existing code generation models lack systematic evaluation for efficiency-critical tasks. Method: We introduce EffiBench—the first benchmark dedicated to code efficiency—comprising 1,000 LeetCode problems sensitive to time/space complexity and their corresponding state-of-the-art human implementations. We pioneer the integration of green computing principles into code generation evaluation by proposing quantifiable efficiency defect metrics and conducting cross-model, standardized assessment under real execution environments with automated performance measurement (CPU time and memory usage). Contribution/Results: Experiments reveal that GPT-4–generated code incurs, on average, 3.12× longer execution time (up to 13.89×) and up to 43.92× higher memory consumption than human-optimal solutions. All evaluated models significantly underperform humans; moreover, open-source models consistently lag behind closed-source counterparts in efficiency.
📝 Abstract
Code generation models have increasingly become integral to aiding software development. Although current research has thoroughly examined the correctness of the code produced by code generation models, a vital aspect that plays a pivotal role in green computing and sustainability efforts has often been neglected. This paper presents EffiBench, a benchmark with 1,000 efficiency-critical coding problems to assess the efficiency of code generated by code generation models. EffiBench contains a diverse set of LeetCode coding problems. Each problem is paired with an executable human-written canonical solution, which obtains the SOTA efficiency on the LeetCode solution leaderboard. With EffiBench, we empirically examine the ability of 42 large language models (35 open-source and 7 closed-source) to generate efficient code. Our evaluation results demonstrate that the efficiency of the code generated by LLMs is generally worse than the efficiency of human-written canonical solutions. For example, GPT-4 generated code has an average extbf{3.12} times execution time that of the human-written canonical solutions. In the most extreme cases, the execution time and total memory usage of GPT-4 generated code are extbf{13.89} and extbf{43.92} times that of the canonical solutions. The source code of EffiBench is released on https://github.com/huangd1999/EffiBench. We also provide the LeaderBoard at https://huggingface.co/spaces/EffiBench/effibench-leaderboard.