Smart but Costly? Benchmarking LLMs on Functional Accuracy and Energy Efficiency

📅 2025-11-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of systematic evaluation of the trade-off between functional correctness and energy efficiency in code language models (CLMs). We propose BRACE, a novel evaluation framework that jointly incorporates functional correctness testing and empirically measured energy consumption. BRACE introduces two complementary metrics: CIRC, a static composite score based on Euclidean distance in the correctness–efficiency space, and OTER, a dynamic weighted ranking method that is trend-aware and adaptive to task-specific energy–accuracy trade-offs—thereby moving beyond single-objective, performance-centric paradigms. Evaluating 22 mainstream CLMs, we find that model size does not determine overall ranking; smaller models often achieve higher scores due to more efficient parameter utilization; and code summarization tasks consistently outperform code generation in both accuracy and energy efficiency. BRACE establishes the first multidimensional benchmark for CLM selection that explicitly balances functional correctness with energy efficiency—enabling sustainable and precise model deployment.

Technology Category

Application Category

📝 Abstract
The rapid advancement of AI technologies and their accelerated adoption in software development necessitates a systematic evaluation of their environmental impact alongside functional correctness. While prior studies have examined sustainability in large language models, existing approaches lack systematic frameworks for evaluating accuracy-energy trade-offs in Code Language Models (CLMs). In this paper, we present a framework, BRACE, to benchmark CLMs on a unified scale of energy efficiency and functional correctness (referred to as accuracy). We benchmark 22 state-of-the-art models on code generation and summarization tasks, proposing two rating methods: Concentric Incremental Rating Circles (CIRC) and Observation to Expectation Rating (OTER). CIRC provides deterministic Euclidean-based rankings with static trade-offs that are robust to outliers, and OTER offers trend-aware evaluation with dynamic trade-offs that capture the complex correlation between energy and accuracy, each offering a distinct perspective and addressing the problem in a unique way. These rating methods enable us to rate LLMs on a 1-5 scale reflecting their combined capabilities in terms of energy efficiency and functional correctness. Our analysis reveals models generally perform better in the code summarization tasks as they are not enforced to generate a grammar-based and syntactically correct output. Also, we find that models'size does not have a significant impact on their ratings, indicating that if models utilize their parameters efficiently, they can be ranked higher on these scales. The proposed BRACE framework empowers practitioners to make evidence-based model selections that balance sustainability with task requirements, guiding rating choice -- CIRC for deterministic comparisons or OTER for trend-aware evaluation -- based on deployment priorities.
Problem

Research questions and friction points this paper is trying to address.

Evaluating accuracy-energy trade-offs in Code Language Models systematically
Benchmarking 22 models on code generation and summarization tasks
Developing rating methods to balance sustainability with functional correctness
Innovation

Methods, ideas, or system contributions that make the work stand out.

BRACE framework benchmarks Code Language Models
CIRC method provides deterministic accuracy-energy rankings
OTER method enables trend-aware evaluation with dynamic trade-offs