TokenFlow: Responsive LLM Text Streaming Serving under Request Burst via Preemptive Scheduling

📅 2025-10-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address high first-token latency and low resource utilization in real-time streaming generation with large language models (LLMs), this paper proposes a preemptive scheduling and proactive KV cache management framework tailored for request burst scenarios. The method introduces: (1) a dynamic priority mechanism based on token buffer occupancy and consumption rate, enabling fine-grained preemption; and (2) GPU-CPU collaborative KV cache pre-migration, asynchronous frontend-backend memory exchange, and computation-I/O overlap to reduce context-switching overhead. Experiments on multi-GPU platforms demonstrate that the approach improves effective throughput by 82.5%, reduces P99 first-token latency by 80.2%, and maintains end-to-end generation throughput without degradation.

Technology Category

Application Category

📝 Abstract
Real-time LLM interactions demand streamed token generations, where text tokens are progressively generated and delivered to users while balancing two objectives: responsiveness (i.e., low time-to-first-token) and steady generation (i.e.,required time-between-tokens). Standard LLM serving systems suffer from the inflexibility caused by non-preemptive request scheduling and reactive memory management, leading to poor resource utilization and low request processing parallelism under request bursts. Therefore, we present TokenFlow, a novel LLM serving system with enhanced text streaming performance via preemptive request scheduling and proactive key-value (KV) cache management. TokenFlow dynamically prioritizes requests based on real-time token buffer occupancy and token consumption rate, while actively transferring KV cache between GPU and CPU memory in the background and overlapping I/O with computation to minimize request preemption overhead. Extensive experiments on Llama3-8B and Qwen2.5-32B across multiple GPUs (RTX 4090, A6000, H200) demonstrate that TokenFlow achieves up to 82.5% higher effective throughput (accounting for actual user consumption) while reducing P99 TTFT by up to 80.2%, without degrading overall token throughput.
Problem

Research questions and friction points this paper is trying to address.

Optimizes LLM streaming for low first-token latency
Improves resource utilization during sudden request bursts
Manages KV cache proactively to reduce preemption overhead
Innovation

Methods, ideas, or system contributions that make the work stand out.

Preemptive request scheduling for dynamic prioritization
Proactive KV cache management between GPU and CPU
Overlapping I/O with computation to minimize overhead
🔎 Similar Papers
No similar papers found.
Junyi Chen
Junyi Chen
Shanghai Jiao Tong University
Generative AIMultimodal Learning
C
Chuheng Du
Shanghai Jiao Tong University, Shanghai, China
Renyuan Liu
Renyuan Liu
Guangzhou University
RoboticsComputational NeuroscienceData Structure and Algorithms
Shuochao Yao
Shuochao Yao
Assistant Professor of Computer Science, George Mason University
D
Dingtian Yan
China Telecom Corporation Limited, Shanghai Branch, Shanghai, China
J
Jiang Liao
China Telecom Corporation Limited, Shanghai Branch, Shanghai, China
Shengzhong Liu
Shengzhong Liu
Shanghai Jiao Tong University
F
Fan Wu
Shanghai Jiao Tong University, Shanghai, China
Guihai Chen
Guihai Chen
Professor of Computer Science
Computer Science and Technology