NanoCP: Request-Level Dynamic Context Parallelism for Data-Expert Parallel Decoding

📅 2026-05-20
📈 Citations: 0
Influential: 0
📄 PDF

career value

247K/year
🤖 AI Summary
This work addresses the tight coupling among attention computation, MoE communication, and KV caching in existing Mixture-of-Experts (MoE) model serving systems, which hinders simultaneous optimization of KV cache efficiency and batch size, leading to expert parallelism straggling and elevated tail latency for long-context requests. To resolve this, the paper proposes Dynamic Context Parallelism (DCP), the first approach enabling request-level dynamic partitioning: long requests are distributed across instances to share KV cache overhead, while short requests are processed locally to minimize communication. Integrated with an AOT graph engine, a routing-aware communication backend, and a data-expert hybrid parallel architecture, DCP achieves 1.88–3.27× higher throughput and 1.79–2.12× lower P99 tail latency under strict per-token output time (TPOT) service-level objectives, substantially mitigating straggler issues.
📝 Abstract
Modern serving systems for Mixture-of-Experts (MoE) models adopt hybrid data-expert parallelism: expert parallelism (EP) shards experts across GPUs to scale capacity, while data parallelism (DP) replicates attention layers across instances to process independent requests. Existing systems bind each request's attention, MoE communication, and KV cache to a single instance. Because attention latency scales with KV cache size while MoE communication latency scales with batch size, this binding cannot balance both simultaneously, producing EP stragglers; it also fragments KV memory across instances, inflating tail latency under long contexts. While existing context parallelism (CP) mitigates these constraints, its uniform parallelism degree incurs prohibitive communication and attention-side overheads. We present \work, which decouples MoE communication from KV cache placement and achieves dual balance through dynamic context parallelism (DCP). DCP assigns each request a context-parallel degree sized to its KV footprint: long requests distribute attention across multiple instances; short requests remain local. This dynamic parallelism effectively liquefies the KV cache across the cluster, balancing both the per-instance KV cache occupancy and batch sizes without unnecessary load-balancing costs. To bridge DCP with static execution, \work introduces an ahead-of-time (AOT) graph engine paired with a custom routing-based communication backend. Experimental results show that \work maintains up to $1.88\times$--$3.27\times$ higher request rates under strict time-per-output-token (TPOT) service level objectives (SLOs). Furthermore, \work significantly mitigates stragglers, reducing P99 tail latency by up to $1.79\times$--$2.12\times$.
Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts
context parallelism
KV cache fragmentation
tail latency
expert parallelism
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic Context Parallelism
Mixture-of-Experts
KV Cache Management
Request-Level Scheduling
Tail Latency Reduction
J
Jiefei Chen
Fudan University, Shanghai, China; Shanghai Artificial Intelligence Laboratory, Shanghai, China
B
Binbin Lin
Huazhong University of Science and Technology, Wuhan, China; Shanghai Artificial Intelligence Laboratory, Shanghai, China
Jinming Ma
Jinming Ma
University of Science and Technology of China
reinforcement learning
Jiangfei Duan
Jiangfei Duan
The Chinese University of Hong Kong
Machine Learning Systems
H
Haojie Duanmu
Shanghai Jiao Tong University, Shanghai, China; Shanghai Artificial Intelligence Laboratory, Shanghai, China
H
Hao Liu
Shanghai Artificial Intelligence Laboratory, Shanghai, China; Harbin Institute of Technology, Harbin, China
Q
Qinxiu Cheng
Shanghai Artificial Intelligence Laboratory, Shanghai, China
Xiuhong Li
Xiuhong Li
Infinigence-AI
Deep Learning SystemGPGPUDeep Learning Compiler
Z
Zhilin Pei
Shanghai Artificial Intelligence Laboratory, Shanghai, China
H
Hui Wang
Shanghai Artificial Intelligence Laboratory, Shanghai, China
X
Xingcheng Zhang
Shanghai Artificial Intelligence Laboratory, Shanghai, China; SenseTime, Shanghai, China
Dahua Lin
Dahua Lin
The Chinese University of Hong Kong
computer visionmachine learningprobabilistic inferencebayesian nonparametrics