FlexLLM: A System for Co-Serving Large Language Model Inference and Parameter-Efficient Finetuning

📅 2024-02-29

🏛️ arXiv.org

📈 Citations: 8

✨ Influential: 1

career value

219K/year

🤖 AI Summary

Existing service systems deploy LLM inference and parameter-efficient fine-tuning (PEFT)—e.g., LoRA—on isolated GPU clusters, leading to significant resource waste and low utilization. This paper introduces the first end-to-end co-serving framework enabling inference and PEFT to coexist on a single GPU. Our approach combines static compilation optimizations—including token-level compute fusion, dependency parallelization, and computation graph pruning—with a GPU memory-aware runtime, hybrid token scheduling, and dynamic batching. This joint design simultaneously guarantees low-latency inference (meeting a 20 req/s SLO) and high-throughput fine-tuning (1.9–6.8× throughput improvement). Experiments demonstrate up to 80% GPU memory reduction; under peak load on models such as LLaMA-3.1-8B, the system sustains over 76% fine-tuning progress while maintaining inference responsiveness.

Technology Category

Application Category

📝 Abstract

Finetuning large language models (LLMs) is essential for task adaptation, yet serving stacks today isolate inference and finetuning on separate GPU clusters -- wasting resources and under-utilizing hardware. We introduce FlexLLM, the first system to co-serve LLM inference and PEFT-based finetuning on shared GPUs by fusing computation at the token level. The static compilation optimizations in FlexLLM -- dependent parallelization and graph pruning significantly shrink activation memory, leading to end-to-end GPU memory savings by up to 80%. At runtime, a novel token-level finetuning mechanism paired with a hybrid token scheduler dynamically interleaves inference and training tokens within each co-serving iteration, meeting strict latency SLOs while maximizing utilization. In end-to-end benchmarks on LLaMA-3.1-8B, Qwen-2.5-14B, and Qwen-2.5-32B, FlexLLM sustains the inference SLO requirements up to 20 req/s, and improves finetuning throughput by 1.9-4.8x under heavy inference workloads and 2.5-6.8x under light loads, preserving over 76% of peak finetuning progress even at peak demand. The source code of FlexLLM is publicly available at https://github.com/flexflow/FlexFlow/.

Problem

Research questions and friction points this paper is trying to address.

Co-serving LLM inference and PEFT-based finetuning on shared GPUs

Reducing GPU memory usage by up to 80% via optimization techniques

Dynamically interleaving inference and training tokens to meet latency SLOs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Co-serves LLM inference and PEFT-based finetuning on shared GPUs

Uses token-level computation fusion to optimize memory and performance

Implements hybrid token scheduler for dynamic interleaving of tasks

🔎 Similar Papers

No similar papers found.