EffiBench-X: A Multi-Language Benchmark for Measuring Efficiency of LLM-Generated Code

๐Ÿ“… 2025-05-19
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing code generation benchmarks overemphasize functional correctness while neglecting runtime efficiency and are largely confined to single-language evaluation. Method: EffiBench-X introduces the first multilingual efficiency benchmark covering Python, C++, Java, JavaScript, Ruby, and Go, built upon competitive programming tasks. It features an automated performance evaluation framework integrating asymptotic time/space complexity analysis, cross-language compilation and execution, and standardized scoring. Contribution/Results: Its core innovation is establishing human expert solutions as the efficiency gold standardโ€”a novel paradigm enabling systematic quantification of LLMsโ€™ cross-linguistic efficiency disparities. Empirical evaluation reveals that the state-of-the-art model Qwen3-32B achieves only 62% of human-level efficiency on average. Notably, dynamically typed languages (Python, JavaScript, Ruby) significantly outperform statically typed ones (Java, C++, Go), and DeepSeek-R1 exhibits pronounced inter-language efficiency bias.

Technology Category

Application Category

๐Ÿ“ Abstract
Existing code generation benchmarks primarily evaluate functional correctness, with limited focus on code efficiency and often restricted to a single language like Python. To address this gap, we introduce EffiBench-X, the first multi-language benchmark designed to measure the efficiency of LLM-generated code. EffiBench-X supports Python, C++, Java, JavaScript, Ruby, and Golang. It comprises competitive programming tasks with human-expert solutions as efficiency baselines. Evaluating state-of-the-art LLMs on EffiBench-X reveals that while models generate functionally correct code, they consistently underperform human experts in efficiency. Even the most efficient LLM-generated solutions (Qwen3-32B) achieve only around extbf{62%} of human efficiency on average, with significant language-specific variations. LLMs show better efficiency in Python, Ruby, and JavaScript than in Java, C++, and Golang. For instance, DeepSeek-R1's Python code is significantly more efficient than its Java code. These results highlight the critical need for research into LLM optimization techniques to improve code efficiency across diverse languages. The dataset and evaluation infrastructure are submitted and available at https://github.com/EffiBench/EffiBench-X.git and https://huggingface.co/datasets/EffiBench/effibench-x.
Problem

Research questions and friction points this paper is trying to address.

Measures efficiency of LLM-generated code across multiple languages
Compares LLM code efficiency to human-expert baselines
Reveals significant efficiency gaps between LLMs and humans
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-language benchmark for code efficiency
Human-expert solutions as efficiency baselines
Evaluates LLMs across Python, C++, Java, etc
๐Ÿ”Ž Similar Papers
No similar papers found.