AccelGen: Heterogeneous SLO-Guaranteed High-Throughput LLM Inference Serving for Diverse Applications

📅 2025-03-17

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

This work addresses LLM inference serving under mixed-length prompts and heterogeneous, iteration-level service-level objective (SLO) constraints—aiming to boost throughput for long prompts while strictly guaranteeing diverse, application-specific SLOs. We propose: (1) an SLO-aware dynamic chunking mechanism; (2) iteration-level SLO-driven task priority scheduling; and (3) a multi-resource batching strategy jointly optimizing GPU compute and KV cache utilization. Evaluated on real systems via trace-driven experiments, our approach achieves 1.42–11.21× higher throughput, 1.43–13.71× higher goodput, 37–90 percentage-point improvement in SLO compliance rate, and 1.61–12.22× reduction in tail latency versus state-of-the-art methods—approaching the Oracle upper bound. To the best of our knowledge, this is the first work to simultaneously achieve high throughput and strong SLO guarantees under heterogeneous iteration-level SLO constraints.

Technology Category

Application Category

📝 Abstract

In this paper, we consider a mixed-prompt scenario for a large language model (LLM) inference serving system that supports diverse applications with both short prompts and long prompts and heterogeneous SLOs for iteration time. To improve throughput when handling long prompts, previous research introduces a chunking method, but has not addressed heterogeneous SLOs. To address the limitation, we propose AccelGen, a high-throughput LLM inference serving system with heterogeneous SLO guarantees for diverse applications. AccelGen introduces four core components: (1) SLO-guaranteed dynamic chunking, which dynamically adjusts chunk sizes to maximize GPU compute utilization while meeting iteration-level SLOs; (2) Iteration-level SLO-based task prioritization, which prioritizes tight-SLO requests and batches requests with similar SLOs; (3) Multi-resource-aware batching, which selects queued requests to maximize the utilizations of both GPU compute resource and key-value cache (KVC). Trace-driven real experiments demonstrate that AccelGen achieves 1.42-11.21X higher throughput, 1.43-13.71X higher goodput, 37-90% higher SLO attainment, and 1.61-12.22X lower response latency compared to the state-of-the-art approaches. It achieves performance near the Oracle, which optimally maximizes goodput.

Problem

Research questions and friction points this paper is trying to address.

Handles mixed-prompt scenarios for LLM inference serving.

Ensures heterogeneous SLO guarantees for diverse applications.

Improves throughput and SLO attainment for long prompts.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic chunking adjusts sizes for GPU utilization.

Task prioritization based on iteration-level SLOs.

Multi-resource-aware batching optimizes GPU and KVC usage.

🔎 Similar Papers

No similar papers found.