🤖 AI Summary
Existing service systems deploy LLM inference and parameter-efficient fine-tuning (PEFT)—e.g., LoRA—on isolated GPU clusters, leading to significant resource waste and low utilization. This paper introduces the first end-to-end co-serving framework enabling inference and PEFT to coexist on a single GPU. Our approach combines static compilation optimizations—including token-level compute fusion, dependency parallelization, and computation graph pruning—with a GPU memory-aware runtime, hybrid token scheduling, and dynamic batching. This joint design simultaneously guarantees low-latency inference (meeting a 20 req/s SLO) and high-throughput fine-tuning (1.9–6.8× throughput improvement). Experiments demonstrate up to 80% GPU memory reduction; under peak load on models such as LLaMA-3.1-8B, the system sustains over 76% fine-tuning progress while maintaining inference responsiveness.
📝 Abstract
Finetuning large language models (LLMs) is essential for task adaptation, yet serving stacks today isolate inference and finetuning on separate GPU clusters -- wasting resources and under-utilizing hardware. We introduce FlexLLM, the first system to co-serve LLM inference and PEFT-based finetuning on shared GPUs by fusing computation at the token level. The static compilation optimizations in FlexLLM -- dependent parallelization and graph pruning significantly shrink activation memory, leading to end-to-end GPU memory savings by up to 80%. At runtime, a novel token-level finetuning mechanism paired with a hybrid token scheduler dynamically interleaves inference and training tokens within each co-serving iteration, meeting strict latency SLOs while maximizing utilization. In end-to-end benchmarks on LLaMA-3.1-8B, Qwen-2.5-14B, and Qwen-2.5-32B, FlexLLM sustains the inference SLO requirements up to 20 req/s, and improves finetuning throughput by 1.9-4.8x under heavy inference workloads and 2.5-6.8x under light loads, preserving over 76% of peak finetuning progress even at peak demand. The source code of FlexLLM is publicly available at https://github.com/flexflow/FlexFlow/.