ELANA: A Simple Energy and Latency Analyzer for LLMs

📅 2025-12-07

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

Addressing the challenge of evaluating latency and energy efficiency for large language models (LLMs) across heterogeneous platforms, this paper introduces ELANA—a lightweight, open-source analytical tool. ELANA proposes the first unified energy-efficiency–latency co-analysis framework, supporting multi-GPU and edge GPU deployments. It quantifies key metrics—including model size, KV cache footprint, time-to-first-token (TTFT), time-per-output-token (TPOT), time-to-last-token (TTLT), and real-time power consumption—in a consistent, platform-agnostic manner. Built on PyTorch and Hugging Face Transformers, it integrates CUDA event-based timing and NVML-based power monitoring, ensuring compatibility with all Hugging Face models, low-bit/quantized variants, and optional energy logging. Evaluated on mainstream open-weight LLMs, ELANA achieves millisecond-level latency decomposition and watt-level power measurement, substantially lowering the barrier to rigorous energy-aware inference analysis. The implementation is publicly released and has gained broad adoption in the research community.

Technology Category

Application Category

📝 Abstract

The latency and power consumption of large language models (LLMs) are major constraints when serving them across a wide spectrum of hardware platforms, from mobile edge devices to cloud GPU clusters. Benchmarking is crucial for optimizing efficiency in both model deployment and next-generation model development. To address this need, we open-source a simple profiling tool, extbf{ELANA}, for evaluating LLMs. ELANA is designed as a lightweight, academic-friendly profiler for analyzing model size, key-value (KV) cache size, prefilling latency (Time-to-first-token, TTFT), generation latency (Time-per-output-token, TPOT), and end-to-end latency (Time-to-last-token, TTLT) of LLMs on both multi-GPU and edge GPU platforms. It supports all publicly available models on Hugging Face and offers a simple command-line interface, along with optional energy consumption logging. Moreover, ELANA is fully compatible with popular Hugging Face APIs and can be easily customized or adapted to compressed or low bit-width models, making it ideal for research on efficient LLMs or for small-scale proof-of-concept studies. We release the ELANA profiling tool at: https://github.com/enyac-group/Elana.

Problem

Research questions and friction points this paper is trying to address.

Profiling latency and power consumption of LLMs across hardware platforms

Providing a lightweight tool for benchmarking LLM efficiency in deployment and development

Enabling analysis of model performance metrics like KV cache and token latencies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight profiler for LLM latency and power analysis

Supports multi-GPU and edge platforms with Hugging Face compatibility

Customizable for compressed models and energy consumption logging

🔎 Similar Papers

No similar papers found.

Cohere

Toronto, Montreal, San Francisco, New York, Paris, Seoul, London / EST / PST

Senior Power Analysis and Optimization Engineer, AI-LLM Systems

Nvidia

base salary range is 136,000 USD - 218,500 USD for Level 3, and 168,000 USD - 264,500 USD for Level 4; equity and benefits

US, CA, Santa Clara / US, TX, Austin

Authors to Follow