Optimizing Long-context LLM Serving via Fine-grained Sequence Parallelism

📅 2025-11-09

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

To address the rigidity of dynamic sequence parallelism (SP) allocation in large language model (LLM) online serving—particularly its inability to adapt to heterogeneous request lengths and fragmented GPU resources—this paper proposes Coarse-grained Dynamic Sequence Parallelism (CDSP), a fine-grained, block-level SP orchestration strategy. CDSP introduces the first block-level SP control mechanism, enabling heterogeneous parallel configurations across distinct computational phases within a single request. It integrates a decoupled cluster architecture, real-time load-aware parallel scheduling, and adaptive block partitioning to achieve load-adaptive scaling and efficient reuse of fragmented resources. Experimental results demonstrate that, under maximum sustainable load, CDSP reduces first-token latency by 4.35×, decreases median inter-token latency by 40.1%, and improves system request throughput by up to 45%.

Technology Category

Application Category

📝 Abstract

With the advancement of large language models (LLMs), their context windows have rapidly expanded. To meet diverse demands from varying-length requests in online services, existing state-of-the-art systems tune the sequence parallelism (SP) allocation. However, current dynamic SP allocation lacks flexibility to (1) support stage-specific parallelism requirements in LLM inference, (2) mitigate the global latency degradation from excessive SP allocation, and (3) exploit resource fragments arising from SP size variation. To tackle this problem, we propose Chunkwise Dynamic Sequence Parallelism (CDSP), a fine-grained parallelism strategy that assigns SP sizes across extit{intra-request} token segments. Based on CDSP, we build Tetris, an LLM serving system that (1) efficiently integrates CDSP into disaggregated cluster to satisfy parallelism heterogeneity, (2) dynamically regulates SP size expansion based on real-time load conditions, and (3) adaptively explores chunking plans to utilize fragmented resources while meeting per-request demands. Compared with state-of-the-art systems, Tetris achieves up to 4.35$ imes$ lower time-to-first-token (TTFT) under max sustainable loads, reduces median time-between-tokens (TBT) by up to 40.1%, and increases the max request capacity by up to 45%.

Problem

Research questions and friction points this paper is trying to address.

Optimizing sequence parallelism for varying-length LLM requests

Mitigating latency degradation from excessive parallelism allocation

Utilizing fragmented resources caused by sequence parallelism variation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-grained parallelism strategy assigns SP sizes across token segments

Dynamically regulates SP size expansion based on real-time load

Adaptively explores chunking plans to utilize fragmented resources

🔎 Similar Papers

Mnemosyne: Parallelization Strategies for Efficiently Serving Multi-Million Context Length LLM Inference Requests Without Approximations