EffiBench: Benchmarking the Efficiency of Automatically Generated Code

📅 2024-02-03
🏛️ Neural Information Processing Systems
📈 Citations: 28
Influential: 2
📄 PDF
🤖 AI Summary
Existing code generation models lack systematic evaluation for efficiency-critical tasks. Method: We introduce EffiBench—the first benchmark dedicated to code efficiency—comprising 1,000 LeetCode problems sensitive to time/space complexity and their corresponding state-of-the-art human implementations. We pioneer the integration of green computing principles into code generation evaluation by proposing quantifiable efficiency defect metrics and conducting cross-model, standardized assessment under real execution environments with automated performance measurement (CPU time and memory usage). Contribution/Results: Experiments reveal that GPT-4–generated code incurs, on average, 3.12× longer execution time (up to 13.89×) and up to 43.92× higher memory consumption than human-optimal solutions. All evaluated models significantly underperform humans; moreover, open-source models consistently lag behind closed-source counterparts in efficiency.

Technology Category

Application Category

📝 Abstract
Code generation models have increasingly become integral to aiding software development. Although current research has thoroughly examined the correctness of the code produced by code generation models, a vital aspect that plays a pivotal role in green computing and sustainability efforts has often been neglected. This paper presents EffiBench, a benchmark with 1,000 efficiency-critical coding problems to assess the efficiency of code generated by code generation models. EffiBench contains a diverse set of LeetCode coding problems. Each problem is paired with an executable human-written canonical solution, which obtains the SOTA efficiency on the LeetCode solution leaderboard. With EffiBench, we empirically examine the ability of 42 large language models (35 open-source and 7 closed-source) to generate efficient code. Our evaluation results demonstrate that the efficiency of the code generated by LLMs is generally worse than the efficiency of human-written canonical solutions. For example, GPT-4 generated code has an average extbf{3.12} times execution time that of the human-written canonical solutions. In the most extreme cases, the execution time and total memory usage of GPT-4 generated code are extbf{13.89} and extbf{43.92} times that of the canonical solutions. The source code of EffiBench is released on https://github.com/huangd1999/EffiBench. We also provide the LeaderBoard at https://huggingface.co/spaces/EffiBench/effibench-leaderboard.
Problem

Research questions and friction points this paper is trying to address.

Assessing efficiency of code from generation models
Comparing LLM-generated code to human solutions
Benchmarking 42 models on 1,000 efficiency problems
Innovation

Methods, ideas, or system contributions that make the work stand out.

EffiBench benchmarks code generation efficiency
Includes 1000 efficiency-critical coding problems
Compares 42 LLMs against human solutions
🔎 Similar Papers
No similar papers found.