🤖 AI Summary
To address high model-loading latency and pipeline stalls during cold starts in serverless deep learning inference, this paper proposes a pipeline-efficient optimization framework tailored for serverless inference. It introduces a novel three-tiered coordination mechanism: (1) MiniLoader—a lightweight parameter initialization module; (2) WeightDecoupler—enabling asynchronous, out-of-order weight loading and application; and (3) Priority-Aware Scheduler—dynamically prioritizing layer-level execution. Collectively, these components decouple model loading from computation and enable fine-grained compute-storage co-scheduling at the layer level. Experimental results demonstrate that, compared to PISeL, our framework reduces end-to-end inference latency by 61.59% on average and improves pipeline utilization by 2.52×. Ablation studies show MiniLoader accounts for 53.41% of the latency reduction, while WeightDecoupler alone accelerates inference by 26.17%.
📝 Abstract
Serverless computing has emerged as a pivotal paradigm for deploying Deep Learning (DL) models, offering automatic scaling and cost efficiency. However, the inherent cold start problem in serverless ML inference systems, particularly the time-consuming model loading process, remains a significant bottleneck. Utilizing pipelined model loading improves efficiency but still suffer from pipeline stalls due to sequential layer construction and monolithic weight loading. In this paper, we propose extit{Cicada}, a novel pipeline optimization framework that coordinates computational, storage, and scheduling resources through three key mechanisms: (1) extit{MiniLoader}: which reduces layer construction overhead by opportunistically optimizing parameter initialization; (2) extit{WeightDecoupler}: decoupling weight file processing from layer construction, enabling asynchronous weight retrieval and out-of-order weight application; (3) extit{Priority-Aware Scheduler}: dynamically allocating resources to ensure high-priority inference tasks are executed promptly. Our experimental results demonstrate that Cicada achieves significant performance improvements over the state-of-the-art PISeL framework. Specifically, Cicada reduces end-to-end inference latency by an average of 61.59%, with the MiniLoader component contributing the majority of this optimization (53.41%), and the WeightDecoupler achieves up to 26.17% improvement. Additionally, Cicada achieves up to 2.52x speedup in the inference pipeline utlization compared to PISeL.