FLYING SERVING: On-the-Fly Parallelism Switching for Large Language Model Serving

📅 2026-02-25

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This work addresses the challenge of simultaneously achieving high throughput, low latency, and long-context support in large language model (LLM) serving under dynamic traffic and heterogeneous request patterns. Existing systems suffer from inflexibility due to static parallelism strategies. Building upon vLLM, this study presents the first runtime, restart-free switching between data parallelism (DP) and tensor parallelism (TP), dynamically adapting parallel configurations to real-time workload demands through state virtualization. Key enablers include zero-copy weight management, KV cache adaptation, pre-initialized communication pools, and a deadlock-free scheduler, collectively ensuring state consistency and efficient reconfiguration. Experiments across three mainstream models and realistic scenarios demonstrate up to 4.79× and 3.47× performance improvements under high and low loads, respectively, effectively coordinating bursty traffic handling, priority-based scheduling, and long-context inference.

Technology Category

Application Category

📝 Abstract

Production LLM serving must simultaneously deliver high throughput, low latency, and sufficient context capacity under non-stationary traffic and mixed request requirements. Data parallelism (DP) maximizes throughput by running independent replicas, while tensor parallelism (TP) reduces per-request latency and pools memory for long-context inference. However, existing serving stacks typically commit to a static parallelism configuration at deployment; adapting to bursts, priorities, or long-context requests is often disruptive and slow. We present Flying Serving, a vLLM-based system that enables online DP-TP switching without restarting engine workers. Flying Serving makes reconfiguration practical by virtualizing the state that would otherwise force data movement: (i) a zero-copy Model Weights Manager that exposes TP shard views on demand, (ii) a KV Cache Adaptor that preserves request KV state across DP/TP layouts, (iii) an eagerly initialized Communicator Pool to amortize collective setup, and (iv) a deadlock-free scheduler that coordinates safe transitions under execution skew. Across three popular LLMs and realistic serving scenarios, Flying Serving improves performance by up to $4.79\times$ under high load and $3.47\times$ under low load while supporting latency- and memory-driven requests.

Problem

Research questions and friction points this paper is trying to address.

large language model serving

parallelism switching

non-stationary traffic

mixed request requirements

dynamic reconfiguration

Innovation

Methods, ideas, or system contributions that make the work stand out.

parallelism switching

LLM serving

zero-copy model weights