SpecFed: Accelerating Federated LLM Inference with Speculative Decoding and Compressed Transmission

📅 2026-04-28
📈 Citations: 0
Influential: 0
📄 PDF

career value

259K/year
🤖 AI Summary
This work addresses the severe communication bottleneck in federated large language model (LLM) inference, which arises from the need for frequent full-model forward passes due to autoregressive generation and the transmission of complete token probability distributions across distributed clients. To mitigate this, the study introduces speculative decoding into federated LLM inference for the first time, proposing a theoretically grounded top-K sparse compression scheme coupled with a server-side probability distribution reconstruction mechanism. This approach substantially reduces communication overhead while preserving generation quality. By effectively alleviating the communication bottleneck and controlling biases in aggregation and acceptance rates, the method achieves higher inference throughput and lower end-to-end latency without compromising output fidelity.
📝 Abstract
Federated inference enhances LLM performance in edge computing through weighted averaging of distributed model predictions. However, autoregressive LLM inference requires frequent full-model forward passes across workers, severely limiting decoding throughput. Distributed deployment further aggravates this due to a communication bottleneck: each worker must transmit full token probability distributions per draft token, dominating end-to-end latency. To address these challenges, we introduce speculative decoding to enable parallel LLM processing and propose a top-K compressed transmission scheme with two server-side reconstruction strategies. We theoretically analyze the robustness of our method in terms of local reconstruction error, aggregation bias, and acceptance-rate bias, and derive corresponding bounds. Experiments demonstrate that our scheme achieves high generation fidelity while significantly reducing communication overhead.
Problem

Research questions and friction points this paper is trying to address.

Federated Inference
Large Language Models
Communication Bottleneck
Autoregressive Decoding
Token Probability Transmission
Innovation

Methods, ideas, or system contributions that make the work stand out.

speculative decoding
federated inference
compressed transmission
top-K sparsification
LLM acceleration
C
Ce Zheng
Department of Broadband Communication, Pengcheng Laboratory, Shenzhen 518055, China
X
Xinghan Wang
Department of Broadband Communication, Pengcheng Laboratory, Shenzhen 518055, China
J
Jiahong Ning
Dalian Maritime University, Dalian 116026, China
Y
Yuxuan Shi
Department of Broadband Communication, Pengcheng Laboratory, Shenzhen 518055, China
N
Ning Huang
Department of Broadband Communication, Pengcheng Laboratory, Shenzhen 518055, China
Tingting Yang
Tingting Yang
Professor, Peng Cheng Laboratory
Integrated Maritime NetworksNET4AICommunications and Computing Integrated Networks