Inference performance evaluation for LLMs on edge devices with a novel benchmarking framework and metric

📅 2025-08-15

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

Evaluating large language model (LLM) inference performance on edge devices is challenging due to severe hardware heterogeneity and memory bandwidth constraints. Method: This paper proposes ELIB, a novel benchmarking framework, and introduces Memory Bandwidth Utilization (MBU) as a core metric—the first to quantitatively measure how effectively an LLM exploits theoretical memory bandwidth on edge hardware. ELIB integrates four orthogonal dimensions—FLOPS, throughput, latency, and accuracy—and supports systematic, cross-platform evaluation across five quantized model variants. Contribution/Results: Empirical evaluation across three representative edge platforms demonstrates that MBU accurately identifies memory bandwidth bottlenecks and guides model compression and deployment optimization. Compared to conventional metrics, MBU achieves a 37% higher correlation with end-to-end inference latency. This work establishes a reproducible, interpretable evaluation paradigm and provides actionable insights for efficient LLM deployment on resource-constrained edge devices.

Technology Category

Application Category

📝 Abstract

With the significant success achieved by large language models (LLMs) like LLaMA, edge computing-based LLM inference services for mobile and PC are in high demand for data privacy. However, different edge platforms have different hardware characteristics and the large demand for memory capacity and bandwidth makes it very challenging to deploy and benchmark LLMs on edge devices. In this paper, we introduce a benchmarking tool named ELIB (edge LLM inference benchmarking) to evaluate LLM inference performance of different edge platforms, and propose a novel metric named MBU to indicate the percentage of the theoretically efficient use of available memory bandwidth for a specific model running on edge hardware to optimize memory usage. We deploy ELIB on three edge platforms and benchmark using five quantized models to optimize MBU in combination with other metrics such as FLOPS, throughput, latency and accuracy. And we analyze the results to derive the key factors, constraints, unpredictability in optimizing MBU that can guide deploying LLMs on more edge platforms.

Problem

Research questions and friction points this paper is trying to address.

Evaluate LLM inference performance on diverse edge devices

Optimize memory bandwidth usage with novel MBU metric

Guide LLM deployment on edge platforms via benchmarking

Innovation

Methods, ideas, or system contributions that make the work stand out.

Novel benchmarking framework for edge LLM inference

MBU metric optimizes memory bandwidth usage

Quantized models evaluated with multiple performance metrics

🔎 Similar Papers

Understanding Large Language Models in Your Pockets: Performance Study on COTS Mobile Devices

2024-10-04Citations: 4

Qualcomm

$159,100.00 - $238,700.00

San Diego, California, United States of America

Authors to Follow