Remoe: Towards Efficient and Low-Cost MoE Inference in Serverless Computing

📅 2025-12-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address high memory overhead, significant cold-start latency, and unpredictable expert activation patterns in Serverless-based Mixture-of-Experts (MoE) inference, this paper proposes a heterogeneous hierarchical deployment architecture: the shared backbone (non-expert modules) is deployed on GPU, while expert modules are dynamically scheduled—based on activation frequency—to either CPU or isolated serverless functions. We introduce three key innovations: (1) Semantic-aware Predictive Selection (SPS) for expert activation prediction; (2) Main-model Memory Pre-allocation (MMP) to guarantee worst-case memory availability; and (3) a joint memory-replica optimization framework integrating Lagrangian duality with the Longest-Processing-Time (LPT) algorithm. Evaluated across multiple LLM benchmarks, our approach reduces inference cost by up to 57%, cuts cold-start latency by 47%, and strictly satisfies SLO constraints—outperforming state-of-the-art methods.

Technology Category

Application Category

📝 Abstract
Mixture-of-Experts (MoE) has become a dominant architecture in large language models (LLMs) due to its ability to scale model capacity via sparse expert activation. Meanwhile, serverless computing, with its elasticity and pay-per-use billing, is well-suited for deploying MoEs with bursty workloads. However, the large number of experts in MoE models incurs high inference costs due to memory-intensive parameter caching. These costs are difficult to mitigate via simple model partitioning due to input-dependent expert activation. To address these issues, we propose Remoe, a heterogeneous MoE inference system tailored for serverless computing. Remoe assigns non-expert modules to GPUs and expert modules to CPUs, and further offloads infrequently activated experts to separate serverless functions to reduce memory overhead and enable parallel execution. We incorporate three key techniques: (1) a Similar Prompts Searching (SPS) algorithm to predict expert activation patterns based on semantic similarity of inputs; (2) a Main Model Pre-allocation (MMP) algorithm to ensure service-level objectives (SLOs) via worst-case memory estimation; and (3) a joint memory and replica optimization framework leveraging Lagrangian duality and the Longest Processing Time (LPT) algorithm. We implement Remoe on Kubernetes and evaluate it across multiple LLM benchmarks. Experimental results show that Remoe reduces inference cost by up to 57% and cold start latency by 47% compared to state-of-the-art baselines.
Problem

Research questions and friction points this paper is trying to address.

Reduces high inference costs from memory-intensive parameter caching in MoE models
Addresses input-dependent expert activation hindering simple model partitioning
Optimizes memory usage and parallel execution in serverless MoE deployment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Heterogeneous system assigns experts to CPUs and non-experts to GPUs
Uses semantic similarity to predict expert activation patterns
Optimizes memory and replicas via Lagrangian duality and LPT algorithm
🔎 Similar Papers
No similar papers found.
W
Wentao Liu
School of Computer Science and Engineering, Southeast University, China
Y
Yuhao Hu
School of Computer Science and Engineering, Southeast University, China
R
Ruiting Zhou
School of Computer Science and Engineering, Southeast University, China
Baochun Li
Baochun Li
Professor of Electrical and Computer Engineering, University of Toronto
Cloud ComputingDistributed Machine LearningSecurityFederated LearningMultimedia Networking
N
Ne Wang
Department of Computing, The Hong Kong Polytechnic University, Hong Kong