FlowPrefill: Decoupling Preemption from Prefill Scheduling Granularity to Mitigate Head-of-Line Blocking in LLM Serving

📅 2026-02-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses head-of-line blocking caused by long requests during the prefill phase in large language model serving, which leads to violations of service-level objectives (SLOs) for high-priority requests due to elevated time-to-first-token (TTFT) latency. To mitigate this, the authors propose an adaptive prefill scheduling mechanism that decouples preemption granularity from scheduling frequency through operator-level preemption and event-driven scheduling. This approach enables fine-grained interruption and low-overhead scheduling while preserving computational efficiency. Furthermore, dynamic prefill chunking is integrated to jointly optimize system responsiveness and throughput. Experimental evaluation under real-world production workloads demonstrates that the proposed method achieves up to 5.6× higher goodput compared to state-of-the-art systems, all while meeting heterogeneous SLO requirements.

Technology Category

Application Category

📝 Abstract
The growing demand for large language models (LLMs) requires serving systems to handle many concurrent requests with diverse service level objectives (SLOs). This exacerbates head-of-line (HoL) blocking during the compute-intensive prefill phase, where long-running requests monopolize resources and delay higher-priority ones, leading to widespread time-to-first-token (TTFT) SLO violations. While chunked prefill enables interruptibility, it introduces an inherent trade-off between responsiveness and throughput: reducing chunk size improves response latency but degrades computational efficiency, whereas increasing chunk size maximizes throughput but exacerbates blocking. This necessitates an adaptive preemption mechanism. However, dynamically balancing execution granularity against scheduling overheads remains a key challenge. In this paper, we propose FlowPrefill, a TTFT-goodput-optimized serving system that resolves this conflict by decoupling preemption granularity from scheduling frequency. To achieve adaptive prefill scheduling, FlowPrefill introduces two key innovations: 1) Operator-Level Preemption, which leverages operator boundaries to enable fine-grained execution interruption without the efficiency loss associated with fixed small chunking; and 2) Event-Driven Scheduling, which triggers scheduling decisions only upon request arrival or completion events, thereby supporting efficient preemption responsiveness while minimizing control-plane overhead. Evaluation on real-world production traces shows that FlowPrefill improves maximum goodput by up to 5.6$\times$ compared to state-of-the-art systems while satisfying heterogeneous SLOs.
Problem

Research questions and friction points this paper is trying to address.

head-of-line blocking
prefill scheduling
large language model serving
time-to-first-token
service level objectives
Innovation

Methods, ideas, or system contributions that make the work stand out.

Operator-Level Preemption
Event-Driven Scheduling
Head-of-Line Blocking
Chunked Prefill
LLM Serving
🔎 Similar Papers
No similar papers found.
C
Chia-chi Hsieh
Tsinghua University
Z
Zan Zong
University of Science and Technology Beijing
Xinyang Chen
Xinyang Chen
Associate Professor, Harbin Institute of Technology (Shenzhen)
machine learningmultimodal learningtransfer learning
J
Jianjiang Li
University of Science and Technology Beijing
Jidong Zhai
Jidong Zhai
Tsinghua University
Parallel ComputingCompilerProgramming ModelGPU
L
Lijie Wen
Tsinghua University