WISP: Waste- and Interference-Suppressed Distributed Speculative LLM Serving at the Edge via Dynamic Drafting and SLO-Aware Batching

📅 2026-01-15

📈 Citations: 0

✨ Influential: 0

career value

241K/year

🤖 AI Summary

In edge-cloud collaborative large language model inference, imbalanced workloads between draft generation and verification lead to resource wastage and verification interference, limiting the efficiency and scalability of speculative decoding. This work formally characterizes and addresses these challenges—namely draft waste and verification interference—for the first time in this setting. We propose a smart speculative controller, a verification time estimator, and an SLO-aware batch scheduler to co-optimize draft generation at the edge and verification in the cloud. Experimental results demonstrate that our approach improves system capacity by up to 2.1× and 4.1×, and effective throughput (goodput) by up to 1.94× and 3.7×, compared to centralized serving and SLED, respectively, achieving highly efficient and accuracy-preserving distributed speculative inference.

Technology Category

Application Category

📝 Abstract

As Large Language Models (LLMs) become increasingly accessible to end users, an ever-growing number of inference requests are initiated from edge devices and computed on centralized GPU clusters. However, the resulting exponential growth in computation workload is placing significant strain on data centers, while edge devices remain largely underutilized, leading to imbalanced workloads and resource inefficiency across the network. Integrating edge devices into the LLM inference process via speculative decoding helps balance the workload between the edge and the cloud, while maintaining lossless prediction accuracy. In this paper, we identify and formalize two critical bottlenecks that limit the efficiency and scalability of distributed speculative LLM serving: Wasted Drafting Time and Verification Interference. To address these challenges, we propose WISP, an efficient and SLO-aware distributed LLM inference system that consists of an intelligent speculation controller, a verification time estimator, and a verification batch scheduler. These components collaboratively enhance drafting efficiency and optimize verification request scheduling on the server. Extensive numerical results show that WISP improves system capacity by up to 2.1x and 4.1x, and increases system goodput by up to 1.94x and 3.7x, compared to centralized serving and SLED, respectively.

Problem

Research questions and friction points this paper is trying to address.

Wasted Drafting Time

Verification Interference

Distributed Speculative LLM Serving

Edge-Cloud Collaboration

Resource Inefficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

speculative decoding

edge-cloud collaboration

SLO-aware batching