Benchmarking Energy Efficiency of Large Language Models Using vLLM

📅 2025-09-10

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

The high energy consumption of large language models (LLMs) severely hinders their sustainable deployment, yet existing energy-efficiency evaluations rely heavily on idealized benchmarks that poorly reflect real-world production workloads. To address this gap, we propose the first energy-efficiency benchmarking framework tailored to realistic LLM inference loads. Built upon vLLM, it establishes a multi-concurrent, dynamically scheduled testbed that emulates production-grade request patterns. We systematically measure and analyze energy consumption across diverse model scales, architectures, and inference workloads. Through cross-model and cross-configuration empirical studies, we quantitatively uncover previously unreported nonlinear relationships between energy efficiency and key factors—including parameter count, attention mechanism design, and hardware utilization. This work not only demonstrates the feasibility of production-relevant energy-efficiency assessment but also delivers a reproducible, extensible quantitative toolkit and actionable optimization guidelines—laying both methodological foundations and practical evidence for green AI systems.

Technology Category

Application Category

📝 Abstract

The prevalence of Large Language Models (LLMs) is having an growing impact on the climate due to the substantial energy required for their deployment and use. To create awareness for developers who are implementing LLMs in their products, there is a strong need to collect more information about the energy efficiency of LLMs. While existing research has evaluated the energy efficiency of various models, these benchmarks often fall short of representing realistic production scenarios. In this paper, we introduce the LLM Efficiency Benchmark, designed to simulate real-world usage conditions. Our benchmark utilizes vLLM, a high-throughput, production-ready LLM serving backend that optimizes model performance and efficiency. We examine how factors such as model size, architecture, and concurrent request volume affect inference energy efficiency. Our findings demonstrate that it is possible to create energy efficiency benchmarks that better reflect practical deployment conditions, providing valuable insights for developers aiming to build more sustainable AI systems.

Problem

Research questions and friction points this paper is trying to address.

Benchmarking energy efficiency of LLMs in realistic scenarios

Evaluating how model size and architecture affect energy consumption

Assessing impact of concurrent requests on inference efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Utilizes vLLM for high-throughput serving

Simulates real-world concurrent request conditions

Benchmarks energy efficiency across model architectures

🔎 Similar Papers

No similar papers found.

Cohere

Toronto, Montreal, San Francisco, New York, Paris, Seoul, London / remote-friendly

Research Engineer - LLM/VLM Inference Optimization (Seed Infra)

ByteDance

圣何塞

Authors to Follow