SLED: A Speculative LLM Decoding Framework for Efficient Edge Serving

๐Ÿ“… 2025-06-11
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the challenge of efficient large language model (LLM) inference under stringent memory and power constraints on resource-constrained edge devices, this paper proposes SLED, a collaborative speculative decoding framework. SLED pioneers the adaptation of speculative decoding to heterogeneous edge computing: lightweight client devices locally generate multiple draft tokens, while a shared edge server performs unified verification via dynamic batchingโ€”enabling hierarchical client-edge collaboration. It supports multi-client reuse of a single target model, significantly reducing GPU memory overhead on the server and ensuring compatibility with diverse hardware platforms. Evaluated on Jetson Orin Nano, Raspberry Pi 5, and RTX 6000, SLED achieves 32โ€“57% end-to-end latency reduction and 2.1โ€“3.8ร— energy efficiency improvement over baseline methods, while preserving original model accuracy; concurrent session capacity increases by 3.4ร—.

Technology Category

Application Category

๐Ÿ“ Abstract
Regardless the advancements in device capabilities, efficient inferencing advanced large language models (LLMs) at the edge remains challenging due to limited device memory and power constraints. Existing strategies, such as aggressive quantization, pruning, or remote inference, trade accuracy for efficiency or lead to substantial cost burdens. This position paper introduces a new approach that leverages speculative decoding, previously viewed primarily as a decoding acceleration technique for autoregressive generation of LLMs, as a promising approach specifically adapted for edge computing by orchestrating computation across heterogeneous devices. We propose SLED, a method that allows lightweight edge devices to draft multiple candidate tokens locally using diverse draft models, while a single, shared edge server efficiently batches and verifies the tokens utilizing a more precise target model. This approach supports device heterogeneity and reduces server-side memory footprint by avoiding the need to deploy multiple target models. Our initial experiments with Jetson Orin Nano, Raspberry Pi 5, and an RTX 6000 edge server indicate substantial benefits: significantly reduced latency, improved energy efficiency, and increased concurrent inference sessions, all without sacrificing model accuracy.
Problem

Research questions and friction points this paper is trying to address.

Efficient LLM inference at edge with limited resources
Balancing accuracy and efficiency in edge computing
Reducing latency and energy use in edge LLM serving
Innovation

Methods, ideas, or system contributions that make the work stand out.

Speculative decoding for edge LLM serving
Heterogeneous device computation orchestration
Shared server verifies multiple local drafts
๐Ÿ”Ž Similar Papers
No similar papers found.
Xiangchen Li
Xiangchen Li
Ph.D. student at Virginia Tech
Wireless CommunicationMachine LearningSignal Processing
Dimitrios Spatharakis
Dimitrios Spatharakis
Postdoctoral Researcher, Ntua
Cloud ComputingInternet of ThingsCyber-Physical SystemsControl TheoryEdge Computing
J
Jiakun Fan
Department of Computer Science, Virginia Tech, Blacksburg, USA
D
Dimitrios Nikolopoulos
Department of Computer Science, Virginia Tech, Blacksburg, USA