BurstGPT: A Real-world Workload Dataset to Optimize LLM Serving Systems

📅 2024-01-31

📈 Citations: 10

✨ Influential: 1

career value

200K/year

🤖 AI Summary

Existing LLM serving systems are often evaluated under unrealistic assumptions due to the lack of real-world workload data, leading to severe discrepancies between expected and deployed QoS and throughput. To address this, we introduce BurstGPT—the first large-scale, open-source, real-world LLM serving workload dataset, comprising 10.31 million request traces collected over 213 days from Azure OpenAI’s GPT service. Its key innovations include the first systematic characterization of user concurrency bursts, multi-granularity dialogue temporal patterns, KV-cache pressure evolution, and service failure modes. Leveraging BurstGPT, we empirically demonstrate significant degradation in efficiency and stability of state-of-the-art cache management, request scheduling, and resource decoupling strategies under realistic workloads. The dataset is publicly released and has been widely adopted by industry for benchmarking and optimizing LLM serving systems.

Technology Category

Application Category

📝 Abstract

Serving systems for Large Language Models (LLMs) are often optimized to improve quality of service (QoS) and throughput. However, due to the lack of open-source LLM serving workloads, these systems are frequently evaluated under unrealistic workload assumptions. Consequently, performance may degrade when systems are deployed in real-world scenarios. This work presents BurstGPT, an LLM serving workload with 10.31 million traces from regional Azure OpenAI GPT services over 213 days. BurstGPT captures LLM serving characteristics from user, model and system perspectives: (1) User request concurrency: burstiness variations of requests in Azure OpenAI GPT services, revealing diversified concurrency patterns in different services and model types. (2) User conversation patterns: counts and intervals within conversations for service optimizations. (3) Model response lengths: auto-regressive serving processes of GPT models, showing statistical relations between requests and their responses. (4) System response failures: failures of conversation and API services, showing intensive resource needs and limited availability of LLM services in Azure. The details of the characteristics can serve multiple purposes in LLM serving optimizations, such as system evaluation and trace provisioning. In our demo evaluation with BurstGPT, frequent variations in BurstGPT reveal declines in efficiency, stability, or reliability in realistic LLM serving. We identify that the generalization of KV cache management, scheduling and disaggregation optimizations can be improved under realistic workload evaluations. BurstGPT is publicly available now at https://github.com/HPMLL/BurstGPT and is widely used to develop prototypes of LLM serving frameworks in the industry.

Problem

Research questions and friction points this paper is trying to address.

Lack of open-source LLM serving workloads for realistic evaluation

Performance degradation in real-world LLM serving scenarios

Need for optimized KV cache, scheduling, and disaggregation under realistic workloads

Innovation

Methods, ideas, or system contributions that make the work stand out.

BurstGPT dataset with 10.31M real-world LLM traces

Captures user, model, and system serving characteristics

Improves KV cache, scheduling, and disaggregation optimizations

🔎 Similar Papers

No similar papers found.