EffiBench: Benchmarking the Efficiency of Automatically Generated Code

📅 2024-02-03

🏛️ Neural Information Processing Systems

📈 Citations: 28

✨ Influential: 2

career value

210K/year

🤖 AI Summary

Existing code generation models lack systematic evaluation for efficiency-critical tasks. Method: We introduce EffiBench—the first benchmark dedicated to code efficiency—comprising 1,000 LeetCode problems sensitive to time/space complexity and their corresponding state-of-the-art human implementations. We pioneer the integration of green computing principles into code generation evaluation by proposing quantifiable efficiency defect metrics and conducting cross-model, standardized assessment under real execution environments with automated performance measurement (CPU time and memory usage). Contribution/Results: Experiments reveal that GPT-4–generated code incurs, on average, 3.12× longer execution time (up to 13.89×) and up to 43.92× higher memory consumption than human-optimal solutions. All evaluated models significantly underperform humans; moreover, open-source models consistently lag behind closed-source counterparts in efficiency.

Technology Category

Application Category

📝 Abstract

Code generation models have increasingly become integral to aiding software development. Although current research has thoroughly examined the correctness of the code produced by code generation models, a vital aspect that plays a pivotal role in green computing and sustainability efforts has often been neglected. This paper presents EffiBench, a benchmark with 1,000 efficiency-critical coding problems to assess the efficiency of code generated by code generation models. EffiBench contains a diverse set of LeetCode coding problems. Each problem is paired with an executable human-written canonical solution, which obtains the SOTA efficiency on the LeetCode solution leaderboard. With EffiBench, we empirically examine the ability of 42 large language models (35 open-source and 7 closed-source) to generate efficient code. Our evaluation results demonstrate that the efficiency of the code generated by LLMs is generally worse than the efficiency of human-written canonical solutions. For example, GPT-4 generated code has an average extbf{3.12} times execution time that of the human-written canonical solutions. In the most extreme cases, the execution time and total memory usage of GPT-4 generated code are extbf{13.89} and extbf{43.92} times that of the canonical solutions. The source code of EffiBench is released on https://github.com/huangd1999/EffiBench. We also provide the LeaderBoard at https://huggingface.co/spaces/EffiBench/effibench-leaderboard.

Problem

Research questions and friction points this paper is trying to address.

Assessing efficiency of code from generation models

Comparing LLM-generated code to human solutions

Benchmarking 42 models on 1,000 efficiency problems

Innovation

Methods, ideas, or system contributions that make the work stand out.

EffiBench benchmarks code generation efficiency

Includes 1000 efficiency-critical coding problems

Compares 42 LLMs against human solutions

🔎 Similar Papers

How Efficient is LLM-Generated Code? A Rigorous & High-Standard Benchmark