DiP-SD: Distributed Pipelined Speculative Decoding for Efficient LLM Inference at the Edge

📅 2026-04-22
📈 Citations: 0
Influential: 0
📄 PDF

career value

229K/year
🤖 AI Summary
This work addresses the low inference throughput of large language models in edge multi-user scenarios by proposing DiP-SD, a novel framework that integrates on-device distributed draft generation with an edge-server-based staged draft-and-verify pipeline. DiP-SD jointly optimizes batch scheduling, user grouping, and per-user draft length to formulate a throughput maximization model, and introduces an efficient alternating optimization algorithm to solve the resulting mixed-integer programming problem. Experimental results on Qwen3-1.7B and Qwen3-32B demonstrate that DiP-SD achieves up to 17.89× higher throughput than standard autoregressive decoding and improves throughput by 1.93× over autoregressive decoding with greedy batching.

Technology Category

Application Category

📝 Abstract
Speculative decoding has emerged as a promising technique for large language model (LLM) inference by accelerating autoregressive decoding via draft-then-verify. This paper studies a new edge scenario with multi-user inference, where draft tokens are generated locally on devices and subsequently offloaded to a centralized edge server for batch verification. The key challenge is to sustain high throughput under coupled decisions of (i) batching and pipeline scheduling and (ii) per user draft token length. We propose DiP-SD, which exploits two complementary parallelism dimensions: device-level distributed drafting and phase-level draft-verify pipelining. We formulate a throughput-maximization objective, defined as the expected number of accepted tokens per unit time, and jointly optimize the number of batches, user-to-batch assignment, and integer draft lengths. To solve the resulting fractional mixed-integer program, DiP-SD scans the batch number and iteratively alternates between an association subproblem and a draft-length subproblem. Numerical results under a Qwen3-1.7B/Qwen3-32B device-edge deployment show that DiP-SD achieves up to 17.89x throughput over autoregressive decoding (AD) and 1.93x over AD with greedy batching.
Problem

Research questions and friction points this paper is trying to address.

speculative decoding
edge inference
throughput optimization
batch scheduling
draft token length
Innovation

Methods, ideas, or system contributions that make the work stand out.

speculative decoding
edge inference
distributed drafting
pipeline scheduling
throughput optimization
🔎 Similar Papers
No similar papers found.