Serving Chain-structured Jobs with Large Memory Footprints with Application to Large Foundation Model Serving

📅 2026-04-16

📈 Citations: 0

✨ Influential: 0

career value

240K/year

🤖 AI Summary

This work addresses the scheduling and resource management challenges in large model inference caused by high GPU memory consumption in chain-of-thought workloads. It formally defines the "server chain composition" problem for the first time, proves its NP-hardness, and introduces a scalable algorithm with provable performance guarantees. The proposed approach integrates pipeline parallelism, block placement strategies, cache allocation mechanisms, and advanced load balancing techniques to jointly optimize server chain construction and resource scheduling. Experimental results demonstrate that the method significantly reduces response latency in distributed large language model serving systems and outperforms state-of-the-art approaches.

Technology Category

Application Category

📝 Abstract

As a current trend in Artificial Intelligence (AI), large foundation models are increasingly employed as the core of AI services. However, even after training, serving such models at scale remains a challenging task due to their heavy resource footprints, particularly in terms of GPU memory. While recent works revealed unique characteristics of systems serving foundation models that distinguish them from traditional distributed computing systems, there is still a lack of fundamental understanding of the underlying system management problems. This work aims at addressing this gap by extracting a novel problem of "server chain composition" via block placement and cache allocation for serving chainstructured jobs with large memory footprints, which models a fundamental problem in serving large foundation models through pipeline parallelism. After showing the NP-hardness of the optimal solution, the focus is turned to developing scalable algorithms with guaranteed performance under state-of-the-art load balancing. Application of the proposed solution to a distributed large language model (LLM) serving system shows significant reduction of response times compared to state-of-the-art solutions.

Problem

Research questions and friction points this paper is trying to address.

server chain composition

large foundation model serving

pipeline parallelism

memory footprint

chain-structured jobs

Innovation

Methods, ideas, or system contributions that make the work stand out.

server chain composition

block placement

cache allocation