GoodSpeed: Optimizing Fair Goodput with Adaptive Speculative Decoding in Distributed Edge Inference

๐Ÿ“… 2025-12-09
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
In multi-user edge scenarios, distributed LLM inference faces a fundamental trade-off between fairness and goodput when heterogeneous draft servers collaborate with a centralized verifier. To address this, we propose an adaptive speculative decoding framework. Methodologically, it (1) introduces a gradient-based proportional-fair scheduling algorithm that optimizes a logarithmic utility function, achieving strict proportional fairness across heterogeneous draft servers for the first time; (2) establishes a fluid-scale model to prove steady-state optimality convergence and bounded approximation error under dynamic workloads; and (3) integrates parallel draft generation using heterogeneous small models, centralized verification, and dynamic task assignment. Experiments demonstrate significant improvements in average goodput, rigorous enforcement of resource fairness, and near-optimal performance under time-varying loadsโ€”enabling scalable edge LLM inference systems.

Technology Category

Application Category

๐Ÿ“ Abstract
Large language models (LLMs) have revolutionized natural language processing, yet their high computational demands pose significant challenges for real-time inference, especially in multi-user server speculative decoding and resource-constrained environments. Speculative decoding has emerged as a promising technique to accelerate LLM inference by using lightweight draft models to generate candidate tokens, which are subsequently verified by a larger, more accurate model. However, ensuring both high goodput (the effective rate of accepted tokens) and fairness across multiple draft servers cooperating with a central verification server remains an open challenge. This paper introduces GOODSPEED, a novel distributed inference framework that optimizes goodput through adaptive speculative decoding. GOODSPEED employs a central verification server that coordinates a set of heterogeneous draft servers, each running a small language model to generate speculative tokens. To manage resource allocation effectively, GOODSPEED incorporates a gradient scheduling algorithm that dynamically assigns token verification tasks, maximizing a logarithmic utility function to ensure proportional fairness across servers. By processing speculative outputs from all draft servers in parallel, the framework enables efficient collaboration between the verification server and distributed draft generators, streamlining both latency and throughput. Through rigorous fluid sample path analysis, we show that GOODSPEED converges to the optimal goodput allocation in steady-state conditions and maintains near-optimal performance with provably bounded error under dynamic workloads. These results demonstrate that GOODSPEED provides a scalable, fair and efficient solution for multi- in distributed LLM inference systems.
Problem

Research questions and friction points this paper is trying to address.

Optimizes goodput and fairness in distributed LLM inference
Manages resource allocation across heterogeneous draft servers
Ensures proportional fairness with dynamic token verification scheduling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive speculative decoding optimizes goodput with draft servers
Gradient scheduling algorithm ensures proportional fairness across servers
Parallel processing of speculative outputs streamlines latency and throughput
๐Ÿ”Ž Similar Papers