PipeSD: An Efficient Cloud-Edge Collaborative Pipeline Inference Framework with Speculative Decoding

📅 2026-05-13

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This work addresses the inefficiencies in existing cloud-edge collaborative inference frameworks for speculative decoding, where serial token generation and communication lead to underutilized resources, and rigid non-autoregressive verification mechanisms often cause premature validation or high rollback overhead. To overcome these limitations, the authors propose PipeSD, a novel framework that introduces a dynamic programming–optimized token-batching pipeline schedule to effectively overlap generation and communication. Additionally, PipeSD incorporates a dual-threshold adaptive verification trigger mechanism coupled with a lightweight Bayesian optimizer for automatic hyperparameter tuning. Implemented using llama-cpp-python, PyTorch, and FastAPI, the system demonstrates 1.16×–2.16× speedup and 14.3%–25.3% energy reduction over state-of-the-art methods on real-world cloud-edge platforms.

📝 Abstract

Speculative decoding can significantly accelerate LLM inference, especially given that its cloud-edge collaborative deployment offers cloud workload offloading, offline robustness, and privacy enhancement. However, existing collaborative inference frameworks with speculative decoding are constrained by (i) sequential token generation and communication with low resource utilization, and (ii) inflexible cloud non-autoregressive verification (NAV) triggering that induces premature verification or costly rollbacks. In this paper, we propose PipeSD, an efficient cloud-edge collaborative pipeline inference framework with speculative decoding. PipeSD overlaps token generation and communication by a token-batch pipeline scheduling mechanism optimized by dynamic programming, and improves verification flexibility through a dual-threshold NAV triggering mechanism with a lightweight Bayesian optimization autotuner. We implement PipeSD using llama-cpp-python, PyTorch, and FastAPI, and evaluate it on a real-world cloud-edge testbed with two draft-target model pairs across four scenarios. Results show that PipeSD consistently outperforms state-of-the-art baselines, achieving 1.16x-2.16x speedup and reducing energy consumption by 14.3%-25.3%.

Problem

Research questions and friction points this paper is trying to address.

speculative decoding

cloud-edge collaboration

pipeline inference

non-autoregressive verification

token generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

speculative decoding

cloud-edge collaboration

pipeline scheduling