SpecEdge: Scalable Edge-Assisted Serving Framework for Interactive LLMs

📅 2025-05-16
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
To address the high infrastructure cost and low resource utilization inherent in centralized server-based LLM serving architectures, this paper proposes an edge-cooperative speculative decoding framework. Our approach is the first to enable edge devices to proactively generate draft sequences and introduces a pipeline-aware multi-request scheduling strategy, enabling deep collaboration between edge and server GPUs across the entire speculative decoding process. Crucially, only lightweight token-level outputs—not full logits or hidden states—are transmitted over the network, eliminating substantial communication overhead. Compared to a pure server-based baseline, our framework achieves a 2.22× improvement in server throughput, reduces end-to-end per-token latency by 11.24%, and improves cost efficiency by 1.91×. This work establishes a novel paradigm wherein edge devices actively and substantively contribute to LLM inference acceleration—moving beyond mere offloading to genuine collaborative computation.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) power many modern applications, but serving them at scale remains costly and resource-intensive. Current server-centric systems overlook consumer-grade GPUs at the edge. We introduce SpecEdge, an edge-assisted inference framework that splits LLM workloads between edge and server GPUs using a speculative decoding scheme, exchanging only token outputs over the network. SpecEdge employs proactive edge drafting to overlap edge token creation with server verification and pipeline-aware scheduling that interleaves multiple user requests to increase server-side throughput. Experiments show SpecEdge enhances overall cost efficiency by 1.91x through achieving 2.22x server throughput, and reduces inter token latency by 11.24% compared to a server-only baseline, introducing a scalable, cost-effective paradigm for LLM serving.
Problem

Research questions and friction points this paper is trying to address.

Serving large language models at scale remains costly and resource-intensive
Current systems overlook consumer-grade GPUs available at the edge
Edge-assisted framework splits LLM workloads between edge and server GPUs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Edge-assisted speculative decoding for LLM inference
Proactive edge drafting overlaps token creation and verification
Pipeline-aware scheduling interleaves requests to boost throughput
🔎 Similar Papers
No similar papers found.