🤖 AI Summary
Addressing the challenge of evaluating latency and energy efficiency for large language models (LLMs) across heterogeneous platforms, this paper introduces ELANA—a lightweight, open-source analytical tool. ELANA proposes the first unified energy-efficiency–latency co-analysis framework, supporting multi-GPU and edge GPU deployments. It quantifies key metrics—including model size, KV cache footprint, time-to-first-token (TTFT), time-per-output-token (TPOT), time-to-last-token (TTLT), and real-time power consumption—in a consistent, platform-agnostic manner. Built on PyTorch and Hugging Face Transformers, it integrates CUDA event-based timing and NVML-based power monitoring, ensuring compatibility with all Hugging Face models, low-bit/quantized variants, and optional energy logging. Evaluated on mainstream open-weight LLMs, ELANA achieves millisecond-level latency decomposition and watt-level power measurement, substantially lowering the barrier to rigorous energy-aware inference analysis. The implementation is publicly released and has gained broad adoption in the research community.
📝 Abstract
The latency and power consumption of large language models (LLMs) are major constraints when serving them across a wide spectrum of hardware platforms, from mobile edge devices to cloud GPU clusters. Benchmarking is crucial for optimizing efficiency in both model deployment and next-generation model development. To address this need, we open-source a simple profiling tool, extbf{ELANA}, for evaluating LLMs. ELANA is designed as a lightweight, academic-friendly profiler for analyzing model size, key-value (KV) cache size, prefilling latency (Time-to-first-token, TTFT), generation latency (Time-per-output-token, TPOT), and end-to-end latency (Time-to-last-token, TTLT) of LLMs on both multi-GPU and edge GPU platforms. It supports all publicly available models on Hugging Face and offers a simple command-line interface, along with optional energy consumption logging. Moreover, ELANA is fully compatible with popular Hugging Face APIs and can be easily customized or adapted to compressed or low bit-width models, making it ideal for research on efficient LLMs or for small-scale proof-of-concept studies. We release the ELANA profiling tool at: https://github.com/enyac-group/Elana.