🤖 AI Summary
To address the high infrastructure cost and low resource utilization inherent in centralized server-based LLM serving architectures, this paper proposes an edge-cooperative speculative decoding framework. Our approach is the first to enable edge devices to proactively generate draft sequences and introduces a pipeline-aware multi-request scheduling strategy, enabling deep collaboration between edge and server GPUs across the entire speculative decoding process. Crucially, only lightweight token-level outputs—not full logits or hidden states—are transmitted over the network, eliminating substantial communication overhead. Compared to a pure server-based baseline, our framework achieves a 2.22× improvement in server throughput, reduces end-to-end per-token latency by 11.24%, and improves cost efficiency by 1.91×. This work establishes a novel paradigm wherein edge devices actively and substantively contribute to LLM inference acceleration—moving beyond mere offloading to genuine collaborative computation.
📝 Abstract
Large language models (LLMs) power many modern applications, but serving them at scale remains costly and resource-intensive. Current server-centric systems overlook consumer-grade GPUs at the edge. We introduce SpecEdge, an edge-assisted inference framework that splits LLM workloads between edge and server GPUs using a speculative decoding scheme, exchanging only token outputs over the network. SpecEdge employs proactive edge drafting to overlap edge token creation with server verification and pipeline-aware scheduling that interleaves multiple user requests to increase server-side throughput. Experiments show SpecEdge enhances overall cost efficiency by 1.91x through achieving 2.22x server throughput, and reduces inter token latency by 11.24% compared to a server-only baseline, introducing a scalable, cost-effective paradigm for LLM serving.