🤖 AI Summary
Evaluating large language model (LLM) inference performance on edge devices is challenging due to severe hardware heterogeneity and memory bandwidth constraints. Method: This paper proposes ELIB, a novel benchmarking framework, and introduces Memory Bandwidth Utilization (MBU) as a core metric—the first to quantitatively measure how effectively an LLM exploits theoretical memory bandwidth on edge hardware. ELIB integrates four orthogonal dimensions—FLOPS, throughput, latency, and accuracy—and supports systematic, cross-platform evaluation across five quantized model variants. Contribution/Results: Empirical evaluation across three representative edge platforms demonstrates that MBU accurately identifies memory bandwidth bottlenecks and guides model compression and deployment optimization. Compared to conventional metrics, MBU achieves a 37% higher correlation with end-to-end inference latency. This work establishes a reproducible, interpretable evaluation paradigm and provides actionable insights for efficient LLM deployment on resource-constrained edge devices.
📝 Abstract
With the significant success achieved by large language models (LLMs) like LLaMA, edge computing-based LLM inference services for mobile and PC are in high demand for data privacy. However, different edge platforms have different hardware characteristics and the large demand for memory capacity and bandwidth makes it very challenging to deploy and benchmark LLMs on edge devices. In this paper, we introduce a benchmarking tool named ELIB (edge LLM inference benchmarking) to evaluate LLM inference performance of different edge platforms, and propose a novel metric named MBU to indicate the percentage of the theoretically efficient use of available memory bandwidth for a specific model running on edge hardware to optimize memory usage. We deploy ELIB on three edge platforms and benchmark using five quantized models to optimize MBU in combination with other metrics such as FLOPS, throughput, latency and accuracy. And we analyze the results to derive the key factors, constraints, unpredictability in optimizing MBU that can guide deploying LLMs on more edge platforms.