P/D-Device: Disaggregated Large Language Model between Cloud and Devices

๐Ÿ“… 2025-08-12
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address high cloud-side decoding resource consumption and the rapidly increasing device-side prefill time-to-first-token (TTFT) with growing prompt lengths in cloud-edge collaborative LLM deployment, this paper proposes a decoupled cloud-edge inference framework. It offloads part of the prefill computation to the cloud while enabling the device to respond immediately upon receiving the first tokenโ€”thereby decoupling TTFT from decoding in both time and space. Key innovations include dynamic prompt optimization, token flow rate control, amortized local prefill, and trajectory-driven real-time parameter decision-making. Experiments demonstrate a โ‰ฅ60% reduction in TTFT, stable per-token generation latency of tens of milliseconds, and up to 15ร— improvement in cloud throughput. The framework achieves low-latency responsiveness while significantly enhancing overall system efficiency and resource utilization.

Technology Category

Application Category

๐Ÿ“ Abstract
Serving disaggregated large language models has been widely adopted in industrial practice for enhanced performance. However, too many tokens generated in decoding phase, i.e., occupying the resources for a long time, essentially hamper the cloud from achieving a higher throughput. Meanwhile, due to limited on-device resources, the time to first token (TTFT), i.e., the latency of prefill phase, increases dramatically with the growth on prompt length. In order to concur with such a bottleneck on resources, i.e., long occupation in cloud and limited on-device computing capacity, we propose to separate large language model between cloud and devices. That is, the cloud helps a portion of the content for each device, only in its prefill phase. Specifically, after receiving the first token from the cloud, decoupling with its own prefill, the device responds to the user immediately for a lower TTFT. Then, the following tokens from cloud are presented via a speed controller for smoothed TPOT (the time per output token), until the device catches up with the progress. On-device prefill is then amortized using received tokens while the resource usage in cloud is controlled. Moreover, during cloud prefill, the prompt can be refined, using those intermediate data already generated, to further speed up on-device inference. We implement such a scheme P/D-Device, and confirm its superiority over other alternatives. We further propose an algorithm to decide the best settings. Real-trace experiments show that TTFT decreases at least 60%, maximum TPOT is about tens of milliseconds, and cloud throughput increases by up to 15x.
Problem

Research questions and friction points this paper is trying to address.

Reduce cloud resource occupation during LLM decoding phase
Minimize on-device prefill latency for faster response
Balance cloud-device workload for optimized throughput
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cloud-device disaggregated LLM for resource optimization
Speed controller for smoothed token output
Prompt refinement to accelerate on-device inference
๐Ÿ”Ž Similar Papers
No similar papers found.
Yibo Jin
Yibo Jin
State Key Lab. for Novel Software Technol., Nanjing Univ., Nanjing, China
Distributed SystemMachine Learning
Y
Yixu Xu
Huawei Technologies Co., Ltd.
Y
Yue Chen
Huawei Technologies Co., Ltd.
C
Chengbin Wang
Huawei Technologies Co., Ltd.
T
Tao Wang
Huawei Technologies Co., Ltd.
Jiaqi Huang
Jiaqi Huang
University of Central Missouri
CybersecurityIoV
R
Rongfei Zhang
Huawei Technologies Co., Ltd.
Yiming Dong
Yiming Dong
Qwen Team, Alibaba Group & Peking University
Machine LearningOptimization Methods
Yuting Yan
Yuting Yan
Nanjing University
Edge IntelligenceAI SystemVideo Analytics System
Ke Cheng
Ke Cheng
Xidian University
Secure Multi-Party Computation
Yingjie Zhu
Yingjie Zhu
Harbin Institute of Technology, Shenzhen
Natural Language ProcessingVision-Language ModelsLarge Language ModelsFact Checking
S
Shulan Wang
Huawei Technologies Co., Ltd.
Q
Qianqian Tang
Huawei Technologies Co., Ltd.
S
Shuaishuai Meng
Huawei Technologies Co., Ltd.
G
Guanxin Cheng
Huawei Technologies Co., Ltd.
Z
Ze Wang
Huawei Technologies Co., Ltd.
S
Shuyan Miao
Huawei Technologies Co., Ltd.
K
Ketao Wang
Huawei Technologies Co., Ltd.
W
Wen Liu
Huawei Technologies Co., Ltd.
Y
Yifan Yang
Huawei Technologies Co., Ltd.
T
Tong Zhang
Huawei Technologies Co., Ltd.
A
Anran Wang
Huawei Technologies Co., Ltd.
C
Chengzhou Lu
Huawei Technologies Co., Ltd.
T
Tiantian Dong
Huawei Technologies Co., Ltd.
Y
Yongsheng Zhang
Huawei Technologies Co., Ltd.