The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations

📅 2025-07-17

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Existing LLM evaluation methods—automated benchmarks and conventional human evaluations—suffer from scalability limitations and a lack of energy-awareness. To address this, we propose GEA, the first publicly available human evaluation platform for LLMs that integrates real-time energy consumption metrics. Built upon an arena-style evaluation architecture, GEA enables side-by-side comparison and ranking of dual model responses while transparently displaying per-response inference energy usage. By introducing an energy-informed evaluation mechanism, we conduct the first empirical study on how energy efficiency information influences human model preferences. Results show that users significantly favor more energy-efficient, smaller-parameter models when energy data is disclosed; in most cases, the marginal performance gains of larger models fail to justify their substantially higher energy costs. This work establishes the paradigm of “energy-aware human evaluation,” providing both a methodological foundation and empirical evidence for sustainable AI assessment.

Technology Category

Application Category

📝 Abstract

The evaluation of large language models is a complex task, in which several approaches have been proposed. The most common is the use of automated benchmarks in which LLMs have to answer multiple-choice questions of different topics. However, this method has certain limitations, being the most concerning, the poor correlation with the humans. An alternative approach, is to have humans evaluate the LLMs. This poses scalability issues as there is a large and growing number of models to evaluate making it impractical (and costly) to run traditional studies based on recruiting a number of evaluators and having them rank the responses of the models. An alternative approach is the use of public arenas, such as the popular LM arena, on which any user can freely evaluate models on any question and rank the responses of two models. The results are then elaborated into a model ranking. An increasingly important aspect of LLMs is their energy consumption and, therefore, evaluating how energy awareness influences the decisions of humans in selecting a model is of interest. In this paper, we present GEA, the Generative Energy Arena, an arena that incorporates information on the energy consumption of the model in the evaluation process. Preliminary results obtained with GEA are also presented, showing that for most questions, when users are aware of the energy consumption, they favor smaller and more energy efficient models. This suggests that for most user interactions, the extra cost and energy incurred by the more complex and top-performing models do not provide an increase in the perceived quality of the responses that justifies their use.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs with human input lacks scalability and cost efficiency

Current methods poorly correlate with human judgment on model quality

Assessing energy awareness impact on human model selection preferences

Innovation

Methods, ideas, or system contributions that make the work stand out.

Incorporates energy consumption data in LLM evaluations

Uses public arena for scalable human evaluations

Promotes energy-efficient model selection by users

🔎 Similar Papers

Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models