FlexPipe: Adapting Dynamic LLM Serving Through Inflight Pipeline Refactoring in Fragmented Serverless Clusters

📅 2025-10-13

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

To address severe request volatility and acute GPU resource fragmentation in serverless LLM serving, this paper proposes a dynamic pipeline reconfiguration architecture. Our method introduces three core innovations: (1) fine-grained model partitioning with computational graph constraint preservation; (2) runtime adaptive pipeline reconfiguration with cache coherence maintenance; and (3) topology-aware, fragmentation-aware GPU resource scheduling. The system integrates real-time request pattern analysis, coherent cache migration, and a deep segmented topology optimization algorithm. Evaluated on an 82-GPU cluster, it achieves up to 8.5× higher resource efficiency, reduces end-to-end latency by 38.3%, and lowers GPU reservation rate from a peak of 75% to 30%. These results demonstrate substantial improvements in service performance and resource utilization under highly dynamic workloads.

Technology Category

Application Category

📝 Abstract

Serving Large Language Models (LLMs) in production faces significant challenges from highly variable request patterns and severe resource fragmentation in serverless clusters. Current systems rely on static pipeline configurations that struggle to adapt to dynamic workload conditions, leading to substantial inefficiencies. We present FlexPipe, a novel system that dynamically reconfigures pipeline architectures during runtime to address these fundamental limitations. FlexPipe decomposes models into fine-grained stages and intelligently adjusts pipeline granularity based on real-time request pattern analysis, implementing three key innovations: fine-grained model partitioning with preserved computational graph constraints, inflight pipeline refactoring with consistent cache transitions, and topology-aware resource allocation that navigates GPU fragmentation. Comprehensive evaluation on an 82-GPU cluster demonstrates that FlexPipe achieves up to 8.5x better resource efficiency while maintaining 38.3% lower latency compared to state-of-the-art systems, reducing GPU reservation requirements from 75% to 30% of peak capacity.

Problem

Research questions and friction points this paper is trying to address.

Adapting dynamic LLM serving for variable workloads

Overcoming resource fragmentation in serverless clusters

Reconfiguring pipeline architectures during runtime

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-grained model partitioning with computational graph constraints

Inflight pipeline refactoring with consistent cache transitions

Topology-aware resource allocation navigating GPU fragmentation

🔎 Similar Papers

No similar papers found.