Cloud to Edge: Benchmarking LLM Inference On Hardware-Accelerated Single-Board Computers

📅 2026-04-24

📈 Citations: 0

✨ Influential: 0

career value

228K/year

🤖 AI Summary

This work addresses the challenges of deploying large language models (LLMs) on edge devices, where the high-dimensional configuration space and the absence of comprehensive, real-world evaluation methodologies hinder effective trade-offs among performance, power consumption, and physical footprint. To bridge this gap, the authors propose a multidimensional benchmarking framework tailored for edge scenarios and present the first systematic evaluation of LLM inference across four representative single-board computers equipped with state-of-the-art NPU/GPU accelerators, measuring key metrics including throughput, energy efficiency, and device size. By integrating model distillation and quantization techniques, the study enables efficient on-device inference suitable for privacy-sensitive and low-connectivity environments, while quantitatively demonstrating the substantial advantages of dedicated hardware accelerators in edge-based generative AI deployments.

📝 Abstract

Large language models (LLMs) are becoming increasingly capable at small parameter scales. At the same time, conventional cloud-centric deployment introduces challenges around data privacy, latency, and cost that are acute in operational technology and defence environments. Advances in model distillation, quantisation, and affordable edge accelerators now make local LLM inference on single-board computers feasible, but the high dimensionality of the configuration space makes identifying optimal deployments difficult without structured evaluation. Existing LLM-specific edge benchmarking efforts rely on CPU-only inference, poor coverage of genuine single-board computers, and generic evaluation tasks that lack multi-dimensional assessment of hardware effectiveness. This paper proposes a multi-dimensional benchmarking methodology that jointly evaluates inference performance and hardware efficiency across four IoT-suitable edge platform configurations testing single-board computers with the latest available hardware accelerators. Our results reveal the benefits of using hardware accelerators such as NPUs and GPUs, along with multi-dimensional evaluations quantifying the trade-offs between power efficiency, physical device size and token throughput; offering practical guidance for deploying generative AI in privacy-sensitive and connectivity-limited environments such as unmanned vehicles and portable, ruggedised operations.

Problem

Research questions and friction points this paper is trying to address.

LLM inference

edge computing

single-board computers

hardware accelerators

benchmarking

Innovation

Methods, ideas, or system contributions that make the work stand out.

edge AI

LLM inference

hardware acceleration