Expert-as-a-Service: Towards Efficient, Scalable, and Robust Large-scale MoE Serving

📅 2025-09-22

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

To address system instability in Mixture-of-Experts (MoE) models caused by dynamic sparse expert invocation under traditional monolithic serving architectures, this paper proposes a decoupled, decentralized MoE serving system. Methodologically, it decomposes the MoE module into stateless expert microservices—enabling fine-grained elasticity and inherent fault tolerance—and replaces centralized scheduling with a lightweight peer-to-peer communication library to eliminate CPU bottlenecks. Furthermore, it introduces a low-overhead dynamic routing scheduler. Experimental results demonstrate that the system achieves throughput comparable to monolithic architectures, incurs less than 2% performance degradation under failure scenarios, and reduces computational resource consumption by 37.5%. Collectively, these contributions significantly enhance the stability, scalability, and resource efficiency of large-scale MoE model deployment.

Technology Category

Application Category

📝 Abstract

Mixture-of-Experts (MoE) models challenge serving infrastructures with dynamic, sparse expert utilization, causing instability on conventional systems designed for dense architectures. We propose EaaS, a novel serving system to enable efficient, scalable, and robust MoE deployment. Our system disaggregates MoE modules into independent, stateless services. This design enables fine-grained resource scaling and provides inherent fault tolerance by decoupling compute units. The architecture is powered by a high-performance, CPU-free peer-to-peer communication library that ensures minimal overhead and high throughput. Experiments confirm EaaS's scalability and efficiency, achieving performance comparable to monolithic systems while providing robust fault tolerance and strong scalability. EaaS incurs less than a 2% throughput reduction under simulated hardware failures that would otherwise halt monolithic architectures. It further saves up to 37.5% of computing resources through dynamic fine-grained adaptation to serving traffic, demonstrating strong resilience for large-scale MoE deployment in production.

Problem

Research questions and friction points this paper is trying to address.

Addressing dynamic sparse expert utilization challenges in MoE model serving

Overcoming instability of conventional systems designed for dense architectures

Enabling efficient scalable robust large-scale Mixture-of-Experts deployment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Disaggregates MoE modules into independent stateless services

Uses CPU-free peer-to-peer communication for minimal overhead

Enables dynamic fine-grained resource scaling and fault tolerance

🔎 Similar Papers

On Expert Estimation in Hierarchical Mixture of Experts: Beyond Softmax Gating Functions