Seesaw: High-throughput LLM Inference via Model Re-sharding

📅 2025-03-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In distributed large language model (LLM) inference, the significant computational disparity between the prefill and decoding phases renders static parallelization strategies suboptimal for throughput optimization. To address this, we propose a throughput-oriented dynamic model resharding mechanism. Our key contributions are: (1) cross-phase dynamic tensor resharding, enabling real-time adaptation of computation load and communication overhead; (2) hierarchical KV-cache buffering with transition-minimized scheduling to reduce reconfiguration overhead; and (3) a batch-aware runtime reconfiguration framework. Evaluated under realistic workloads, our approach achieves up to 1.78× and an average 1.36× higher throughput compared to the state-of-the-art vLLM engine, significantly improving phase-adaptive parallel efficiency.

Technology Category

Application Category

📝 Abstract
To improve the efficiency of distributed large language model (LLM) inference, various parallelization strategies, such as tensor and pipeline parallelism, have been proposed. However, the distinct computational characteristics inherent in the two stages of LLM inference-prefilling and decoding-render a single static parallelization strategy insufficient for the effective optimization of both stages. In this work, we present Seesaw, an LLM inference engine optimized for throughput-oriented tasks. The key idea behind Seesaw is dynamic model re-sharding, a technique that facilitates the dynamic reconfiguration of parallelization strategies across stages, thereby maximizing throughput at both phases. To mitigate re-sharding overhead and optimize computational efficiency, we employ tiered KV cache buffering and transition-minimizing scheduling. These approaches work synergistically to reduce the overhead caused by frequent stage transitions while ensuring maximum batching efficiency. Our evaluation demonstrates that Seesaw achieves a throughput increase of up to 1.78x (1.36x on average) compared to vLLM, the most widely used state-of-the-art LLM inference engine.
Problem

Research questions and friction points this paper is trying to address.

Optimizes distributed LLM inference efficiency
Addresses inefficiency in static parallelization strategies
Enhances throughput via dynamic model re-sharding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic model re-sharding for LLM inference
Tiered KV cache buffering reduces overhead
Transition-minimizing scheduling enhances batching efficiency
🔎 Similar Papers
No similar papers found.