ConfigSpec: Profiling-Based Configuration Selection for Distributed Edge--Cloud Speculative LLM Serving

📅 2026-04-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In distributed edge-cloud collaborative inference, speculative decoding must navigate a vast joint configuration space—spanning draft model variants, quantization levels, speculation lengths, and heterogeneous devices—to balance throughput, cost, and energy efficiency, as no single fixed configuration can simultaneously optimize all three objectives. This work proposes ConfigSpec, a framework that leverages device- and model-aligned performance profiling to model throughput, acceptance rate, and power consumption during the draft phase, enabling dynamic multi-objective configuration selection. Experiments show that maximum throughput is achieved with device-dependent optimal speculation lengths (K* = 2–10) and the smallest draft model, while both cost and energy efficiency converge at K = 2 but favor the largest and smallest draft models, respectively, thereby validating the necessity and effectiveness of dynamic configuration adaptation.

Technology Category

Application Category

📝 Abstract
Speculative decoding enables collaborative Large Language Model (LLM) inference across cloud and edge by separating lightweight token drafting from heavyweight verification. While prior systems show performance and cost benefits, practical deployment requires navigating a large configuration space spanning draft model variants, quantisation levels, speculative lengths, and heterogeneous edge devices. This paper presents ConfigSpec, a configurationselection framework for distributed speculative LLM serving. ConfigSpec profiles edge devices and draft-target alignment, and models drafting throughput, acceptance rate, and power to evaluate goodput, verification cost efficiency, and energy efficiency across the joint configuration space. Our analysis across three edge platforms and two LLM families reveals structurally conflicting optima. Firstly, goodput is maximised by the smallest, fastest draft model at device-dependent speculative lengths (K*=2-10). Secondly, both cost and energy efficiency converge to K=2 due to a dominant bonus-token effect-with cost favouring the largest drafter for its high acceptance rate and energy favouring the smallest for its low power draw. These conflicts confirm that no single fixed configuration can simultaneously optimise all objectives, underscoring the need for profiling-based configuration selection in disaggregated edge-cloud LLM inference.
Problem

Research questions and friction points this paper is trying to address.

speculative decoding
configuration selection
edge-cloud LLM serving
goodput
energy efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

speculative decoding
configuration selection
edge-cloud inference
LLM serving
profiling-based optimization