🤖 AI Summary
This study addresses the lack of systematic evaluation of large language models’ (LLMs) code generation with respect to performance and energy efficiency. We conduct the first empirical, cross-model (GitHub Copilot, GPT-4o, o1-mini), cross-language (Python, Java, C++), and cross-platform (Mac via powermetrics; Windows/Linux PC via RAPL) assessment using high-difficulty LeetCode problems. Our unified evaluation framework integrates benchmarking, fine-grained power measurement, and code quality analysis. Results show that LLMs achieve significantly higher functional correctness in Python and Java than in C++; for functionally equivalent tasks, their generated code consumes 37%–219% more energy than human-written implementations and exhibits substantial runtime variance. Crucially, we uncover previously unreported coupled effects of programming language, model architecture, and hardware platform on energy efficiency—thereby advancing beyond traditional correctness-only evaluation paradigms.
📝 Abstract
Large language models (LLMs) are used in software development to assist in various tasks, e.g., code generation and code completion, but empirical evaluations of the quality of the results produced by these models focus on correctness and ignore other relevant aspects, such as their performance and energy efficiency. Studying the performance of LLM-produced programs is essential to understand how well LLMs can support the construction of performance- and energy-critical software, such as operating systems, servers, and mobile applications. This paper presents the first study analyzing the energy efficiency and performance of LLM-generated code for three programming languages Python, Java, and C++, on two platforms, a Mac and a PC, leveraging three frontier LLMs, Github Copilot, GPT-4o, and the recently-released OpenAI o1-mini, and targeting ``hard'' programming problems from LeetCode. Our results show that the models are much more successful in generating Python and Java than C++ code.