🤖 AI Summary
To address inefficient resource allocation, high inference latency, and slow cold starts in large language model (LLM) cloud services, this paper introduces DeepFlow—the first serverless AI platform tailored for Ascend NPU clusters. Methodologically, DeepFlow features: (1) a three-tier abstraction model (request–job–task); (2) FlowServe, a microkernel-style, NPU-native service engine; (3) the first PD-decoupled and co-located collaborative scheduling strategy; and (4) integrated optimizations including DRAM preloading, NPU-fork, and warm-up Pods. Deployed on large-scale Ascend clusters, DeepFlow has sustained industrial-grade operation for over one year, supporting fine-tuning, agent execution, and model-serving APIs. It achieves second-level elastic scaling across up to 64 instances, reduces end-to-end latency by 42%, and cuts resource overhead by 37%.
📝 Abstract
This paper introduces DeepFlow, a scalable and serverless AI platform designed to efficiently serve large language models (LLMs) at scale in cloud environments. DeepFlow addresses key challenges such as resource allocation, serving efficiency, and cold start latencies through four main design components. First, it uses a simple serverless abstraction called the request-job-task model, which helps manage AI workloads across post-training and model serving tasks. Second, it builds an in-house serving engine FlowServe using a microkernel-inspired design, NPU-centric execution, and SPMD-based parallelism to optimize LLM serving. The system also includes novel scheduling policies tailored for both PD-disaggregated and PD-colocated configurations. With optimizations like pre-warmed pods, DRAM pre-loading, and NPU-fork, DeepFlow can scale up to 64 instances in seconds. DeepFlow has been in production for over a year, operating on a large Ascend NPU cluster and providing industrystandard APIs for fine-tuning, agent serving, and model serving to our customers.