🤖 AI Summary
This study systematically evaluates the effectiveness and implicit trade-offs of large language models (LLMs) in green optimization—i.e., energy and resource efficiency—of MATLAB scientific code. Method: Leveraging 400 real-world GitHub projects, we benchmark GPT-3/4, Llama, and Mixtral against expert developer suggestions (2,176 recommendations), establishing the first energy-aware taxonomy of 13 optimization categories. We conduct multidimensional empirical evaluation across energy consumption, memory usage, execution time, and functional correctness, complemented by statistical testing and qualitative root-cause analysis. Contribution/Results: LLMs do not significantly reduce energy consumption or execution time; instead, they increase average memory footprint. However, they outperform human experts in code readability and error handling. Critically, we identify “pseudo-green” practices—e.g., trading higher memory for lower CPU utilization—and advocate for standardized green coding evaluation metrics to guide sustainable AI-assisted development.
📝 Abstract
The rapid technological evolution has accelerated software development for various domains and use cases, contributing to a growing share of global carbon emissions. While recent large language models (LLMs) claim to assist developers in optimizing code for performance and energy efficiency, their efficacy in real-world scenarios remains under exploration. In this work, we explore the effectiveness of LLMs in reducing the environmental footprint of real-world projects, focusing on software written in Matlab-widely used in both academia and industry for scientific and engineering applications. We analyze energy-focused optimization on 400 scripts across 100 top GitHub repositories. We examine potential 2,176 optimizations recommended by leading LLMs, such as GPT-3, GPT-4, Llama, and Mixtral, and a senior Matlab developer, on energy consumption, memory usage, execution time consumption, and code correctness. The developer serves as a real-world baseline for comparing typical human and LLM-generated optimizations. Mapping these optimizations to 13 high-level themes, we found that LLMs propose a broad spectrum of improvements--beyond energy efficiency--including improving code readability and maintainability, memory management, error handling while the developer overlooked some parallel processing, error handling etc. However, our statistical tests reveal that the energy-focused optimizations unexpectedly negatively impacted memory usage, with no clear benefits regarding execution time or energy consumption. Our qualitative analysis of energy-time trade-offs revealed that some themes, such as vectorization preallocation, were among the common themes shaping these trade-offs. With LLMs becoming ubiquitous in modern software development, our study serves as a call to action: prioritizing the evaluation of common coding practices to identify the green ones.