QoS-Efficient Serving of Multiple Mixture-of-Expert LLMs Using Partial Runtime Reconfiguration

📅 2025-05-10

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

To address high memory overhead and service-level interference when deploying multiple fine-tuned Mixture-of-Experts (MoE) large language models on a single GPU under multi-tenancy, this paper proposes a similarity-driven expert merging and runtime partial reconfiguration mechanism. It clusters semantically similar experts across models to enable cross-model expert sharing—reducing GPU memory footprint—while dynamically switching non-shared layers via Partial Runtime Reconfiguration to preserve output quality. We further design a lightweight MoE model scheduler and a single-GPU multi-instance serving architecture. On a single A100 GPU, our approach reduces average job turnaround time by 85% compared to NVIDIA’s Multi-Instance GPU (MIG), incurs negligible degradation in time-to-first-token (TTFT), and achieves throughput comparable to single-model service levels. Extensive evaluation on Switch Transformer variants demonstrates significant improvements over model fusion baselines.

Technology Category

Application Category

📝 Abstract

The deployment of mixture-of-experts (MoE) large language models (LLMs) presents significant challenges due to their high memory demands. These challenges become even more pronounced in multi-tenant environments, where shared resources must accommodate multiple models, limiting the effectiveness of conventional virtualization techniques. This paper addresses the problem of efficiently serving multiple fine-tuned MoE-LLMs on a single-GPU. We propose a serving system that employs extit{similarity-based expert consolidation} to reduce the overall memory footprint by sharing similar experts across models. To ensure output quality, we introduce extit{runtime partial reconfiguration}, dynamically replacing non-expert layers when processing requests from different models. As a result, our approach achieves a competitive output quality while maintaining throughput comparable to serving a single model while incurring a negligible increase in time-to-first-token (TTFT). Experiments on a server with a single NVIDIA A100 GPU (80GB) using Mixtral-8x7B models demonstrate an 85% average reduction in turnaround time compared to NVIDIA's multi-instance GPU (MIG). Furthermore, experiments on Google's Switch Transformer Base-8 model with up to four variants demonstrate the scalability and resilience of our approach in maintaining output quality compared to other model merging baselines, highlighting its effectiveness.

Problem

Research questions and friction points this paper is trying to address.

Efficiently serving multiple MoE-LLMs on single-GPU

Reducing memory footprint via similarity-based expert consolidation

Maintaining output quality with runtime partial reconfiguration

Innovation

Methods, ideas, or system contributions that make the work stand out.

Similarity-based expert consolidation reduces memory

Runtime partial reconfiguration replaces non-expert layers

Maintains throughput comparable to single model serving

🔎 Similar Papers

Intelligent Router for LLM Workloads: Improving Performance Through Workload-Aware Load Balancing