🤖 AI Summary
To address the challenge of simultaneously achieving low latency and high throughput for large language model (LLM) inference under dynamic workloads, this paper proposes a dynamically switchable inference framework that integrates tensor parallelism (TP) and sequence parallelism (SP). It is the first work to introduce SP into the inference phase, leveraging its KV-cache invariance property and synergistic optimization with TP. A lightweight GPU communication mechanism is designed to ensure KV-cache consistency and enable efficient dynamic scheduling. The framework automatically switches to SP under low-traffic conditions to minimize latency and to TP under high-traffic conditions to maximize throughput. Experiments demonstrate that, under interactive workloads, the framework reduces first-token latency by 34% (improving response speed by 1.51×) and increases batched throughput by 50%, significantly improving the latency-throughput trade-off.
📝 Abstract
Efficient parallelism is necessary for achieving low-latency, high-throughput inference with large language models (LLMs). Tensor parallelism (TP) is the state-of-the-art method for reducing LLM response latency, however GPU communications reduces combined token throughput. On the other hand, data parallelism (DP) obtains a higher throughput yet is slow in response latency. Best of both worlds does not exist, and it is not possible to combine TP and DP because of the KV cache variance across the parallelisms.
We notice Sequence Parallelism (SP - Ulysses in training) has similar properties as DP but with KV cache invariance. We adapt SP to inference, and combine it with TP to get the best of both worlds. Our solution: Shift Parallelism.
Shift Parallelism dynamically switches across TP and SP, and minimizes latency in low traffic without losing throughput in high traffic. The efficient GPU communications of Shift Parallelism yields up to i) 1.51x faster response in interactive workloads and ii) 50% higher throughput in batch workloads, compared to a TP-only solution.
We evaluate Shift Parallelism with real-world production traces with dynamic traffic patterns as well as synthetic benchmarking patterns across models, context sizes, and arrival rates. All results affirm the same: Shift Parallelism has a better the latency vs. throughput tradeoff than TP or DP, and hence obtains low latency without degrading throughput in dynamic workloads.