Inference performance evaluation for LLMs on edge devices with a novel benchmarking framework and metric

📅 2025-08-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Evaluating large language model (LLM) inference performance on edge devices is challenging due to severe hardware heterogeneity and memory bandwidth constraints. Method: This paper proposes ELIB, a novel benchmarking framework, and introduces Memory Bandwidth Utilization (MBU) as a core metric—the first to quantitatively measure how effectively an LLM exploits theoretical memory bandwidth on edge hardware. ELIB integrates four orthogonal dimensions—FLOPS, throughput, latency, and accuracy—and supports systematic, cross-platform evaluation across five quantized model variants. Contribution/Results: Empirical evaluation across three representative edge platforms demonstrates that MBU accurately identifies memory bandwidth bottlenecks and guides model compression and deployment optimization. Compared to conventional metrics, MBU achieves a 37% higher correlation with end-to-end inference latency. This work establishes a reproducible, interpretable evaluation paradigm and provides actionable insights for efficient LLM deployment on resource-constrained edge devices.

Technology Category

Application Category

📝 Abstract
With the significant success achieved by large language models (LLMs) like LLaMA, edge computing-based LLM inference services for mobile and PC are in high demand for data privacy. However, different edge platforms have different hardware characteristics and the large demand for memory capacity and bandwidth makes it very challenging to deploy and benchmark LLMs on edge devices. In this paper, we introduce a benchmarking tool named ELIB (edge LLM inference benchmarking) to evaluate LLM inference performance of different edge platforms, and propose a novel metric named MBU to indicate the percentage of the theoretically efficient use of available memory bandwidth for a specific model running on edge hardware to optimize memory usage. We deploy ELIB on three edge platforms and benchmark using five quantized models to optimize MBU in combination with other metrics such as FLOPS, throughput, latency and accuracy. And we analyze the results to derive the key factors, constraints, unpredictability in optimizing MBU that can guide deploying LLMs on more edge platforms.
Problem

Research questions and friction points this paper is trying to address.

Evaluate LLM inference performance on diverse edge devices
Optimize memory bandwidth usage with novel MBU metric
Guide LLM deployment on edge platforms via benchmarking
Innovation

Methods, ideas, or system contributions that make the work stand out.

Novel benchmarking framework for edge LLM inference
MBU metric optimizes memory bandwidth usage
Quantized models evaluated with multiple performance metrics
H
Hao Chen
Xidian University, No.2 South Taibai Road, Xi’an, 710071, Shaanxi, China
Cong Tian
Cong Tian
Xidian University
Formal methodsProgram verificationSoftware engineering
Z
Zixuan He
Xidian University, No.2 South Taibai Road, Xi’an, 710071, Shaanxi, China
B
Bin Yu
Xidian University, No.2 South Taibai Road, Xi’an, 710071, Shaanxi, China
Yepang Liu
Yepang Liu
Associate Professor, CSE, Southern University of Science and Technology
Software testing and analysisempirical software engineeringsoftware securitycyber-physical
Jialun Cao
Jialun Cao
The Hong Kong University of Science and Technology
SE for AIAI for SE