BurstGPT: A Real-world Workload Dataset to Optimize LLM Serving Systems

📅 2024-01-31
📈 Citations: 10
Influential: 1
📄 PDF
🤖 AI Summary
Existing LLM serving systems are often evaluated under unrealistic assumptions due to the lack of real-world workload data, leading to severe discrepancies between expected and deployed QoS and throughput. To address this, we introduce BurstGPT—the first large-scale, open-source, real-world LLM serving workload dataset, comprising 10.31 million request traces collected over 213 days from Azure OpenAI’s GPT service. Its key innovations include the first systematic characterization of user concurrency bursts, multi-granularity dialogue temporal patterns, KV-cache pressure evolution, and service failure modes. Leveraging BurstGPT, we empirically demonstrate significant degradation in efficiency and stability of state-of-the-art cache management, request scheduling, and resource decoupling strategies under realistic workloads. The dataset is publicly released and has been widely adopted by industry for benchmarking and optimizing LLM serving systems.

Technology Category

Application Category

📝 Abstract
Serving systems for Large Language Models (LLMs) are often optimized to improve quality of service (QoS) and throughput. However, due to the lack of open-source LLM serving workloads, these systems are frequently evaluated under unrealistic workload assumptions. Consequently, performance may degrade when systems are deployed in real-world scenarios. This work presents BurstGPT, an LLM serving workload with 10.31 million traces from regional Azure OpenAI GPT services over 213 days. BurstGPT captures LLM serving characteristics from user, model and system perspectives: (1) User request concurrency: burstiness variations of requests in Azure OpenAI GPT services, revealing diversified concurrency patterns in different services and model types. (2) User conversation patterns: counts and intervals within conversations for service optimizations. (3) Model response lengths: auto-regressive serving processes of GPT models, showing statistical relations between requests and their responses. (4) System response failures: failures of conversation and API services, showing intensive resource needs and limited availability of LLM services in Azure. The details of the characteristics can serve multiple purposes in LLM serving optimizations, such as system evaluation and trace provisioning. In our demo evaluation with BurstGPT, frequent variations in BurstGPT reveal declines in efficiency, stability, or reliability in realistic LLM serving. We identify that the generalization of KV cache management, scheduling and disaggregation optimizations can be improved under realistic workload evaluations. BurstGPT is publicly available now at https://github.com/HPMLL/BurstGPT and is widely used to develop prototypes of LLM serving frameworks in the industry.
Problem

Research questions and friction points this paper is trying to address.

Lack of open-source LLM serving workloads for realistic evaluation
Performance degradation in real-world LLM serving scenarios
Need for optimized KV cache, scheduling, and disaggregation under realistic workloads
Innovation

Methods, ideas, or system contributions that make the work stand out.

BurstGPT dataset with 10.31M real-world LLM traces
Captures user, model, and system serving characteristics
Improves KV cache, scheduling, and disaggregation optimizations
🔎 Similar Papers
No similar papers found.
Y
Yuxin Wang
Hong Kong Baptist University
Y
Yuhan Chen
The Hong Kong University of Science and Technology (Guangzhou)
Z
Zeyu Li
The Hong Kong University of Science and Technology (Guangzhou)
Xueze Kang
Xueze Kang
HKUST(gz)
HPC
Zhenheng Tang
Zhenheng Tang
The Hong Kong University of Science and Technology
Machine LearningML SystemsLarge Language ModelPersonal AI
X
Xin He
National University of Singapore
R
Rui Guo
Tsinghua University
X
Xin Wang
Tsinghua University
Q
Qiang Wang
Harbin Institute of Technology, Shenzhen
Amelie Chi Zhou
Amelie Chi Zhou
Assistant Professor, HKBU, Hong Kong
High performance computingCloud computingBig data analytics
Xiaowen Chu
Xiaowen Chu
IEEE Fellow, Professor, Data Science and Analytics, HKUST(GZ)
GPU ComputingMachine Learning SystemsParallel and Distributed ComputingWireless Networks