🤖 AI Summary
In distributed edge-cloud collaborative inference, speculative decoding must navigate a vast joint configuration space—spanning draft model variants, quantization levels, speculation lengths, and heterogeneous devices—to balance throughput, cost, and energy efficiency, as no single fixed configuration can simultaneously optimize all three objectives. This work proposes ConfigSpec, a framework that leverages device- and model-aligned performance profiling to model throughput, acceptance rate, and power consumption during the draft phase, enabling dynamic multi-objective configuration selection. Experiments show that maximum throughput is achieved with device-dependent optimal speculation lengths (K* = 2–10) and the smallest draft model, while both cost and energy efficiency converge at K = 2 but favor the largest and smallest draft models, respectively, thereby validating the necessity and effectiveness of dynamic configuration adaptation.
📝 Abstract
Speculative decoding enables collaborative Large Language Model (LLM) inference across cloud and edge by separating lightweight token drafting from heavyweight verification. While prior systems show performance and cost benefits, practical deployment requires navigating a large configuration space spanning draft model variants, quantisation levels, speculative lengths, and heterogeneous edge devices. This paper presents ConfigSpec, a configurationselection framework for distributed speculative LLM serving. ConfigSpec profiles edge devices and draft-target alignment, and models drafting throughput, acceptance rate, and power to evaluate goodput, verification cost efficiency, and energy efficiency across the joint configuration space. Our analysis across three edge platforms and two LLM families reveals structurally conflicting optima. Firstly, goodput is maximised by the smallest, fastest draft model at device-dependent speculative lengths (K*=2-10). Secondly, both cost and energy efficiency converge to K=2 due to a dominant bonus-token effect-with cost favouring the largest drafter for its high acceptance rate and energy favouring the smallest for its low power draw. These conflicts confirm that no single fixed configuration can simultaneously optimise all objectives, underscoring the need for profiling-based configuration selection in disaggregated edge-cloud LLM inference.