QoS-Efficient Serving of Multiple Mixture-of-Expert LLMs Using Partial Runtime Reconfiguration

📅 2025-05-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address high memory overhead and service-level interference when deploying multiple fine-tuned Mixture-of-Experts (MoE) large language models on a single GPU under multi-tenancy, this paper proposes a similarity-driven expert merging and runtime partial reconfiguration mechanism. It clusters semantically similar experts across models to enable cross-model expert sharing—reducing GPU memory footprint—while dynamically switching non-shared layers via Partial Runtime Reconfiguration to preserve output quality. We further design a lightweight MoE model scheduler and a single-GPU multi-instance serving architecture. On a single A100 GPU, our approach reduces average job turnaround time by 85% compared to NVIDIA’s Multi-Instance GPU (MIG), incurs negligible degradation in time-to-first-token (TTFT), and achieves throughput comparable to single-model service levels. Extensive evaluation on Switch Transformer variants demonstrates significant improvements over model fusion baselines.

Technology Category

Application Category

📝 Abstract
The deployment of mixture-of-experts (MoE) large language models (LLMs) presents significant challenges due to their high memory demands. These challenges become even more pronounced in multi-tenant environments, where shared resources must accommodate multiple models, limiting the effectiveness of conventional virtualization techniques. This paper addresses the problem of efficiently serving multiple fine-tuned MoE-LLMs on a single-GPU. We propose a serving system that employs extit{similarity-based expert consolidation} to reduce the overall memory footprint by sharing similar experts across models. To ensure output quality, we introduce extit{runtime partial reconfiguration}, dynamically replacing non-expert layers when processing requests from different models. As a result, our approach achieves a competitive output quality while maintaining throughput comparable to serving a single model while incurring a negligible increase in time-to-first-token (TTFT). Experiments on a server with a single NVIDIA A100 GPU (80GB) using Mixtral-8x7B models demonstrate an 85% average reduction in turnaround time compared to NVIDIA's multi-instance GPU (MIG). Furthermore, experiments on Google's Switch Transformer Base-8 model with up to four variants demonstrate the scalability and resilience of our approach in maintaining output quality compared to other model merging baselines, highlighting its effectiveness.
Problem

Research questions and friction points this paper is trying to address.

Efficiently serving multiple MoE-LLMs on single-GPU
Reducing memory footprint via similarity-based expert consolidation
Maintaining output quality with runtime partial reconfiguration
Innovation

Methods, ideas, or system contributions that make the work stand out.

Similarity-based expert consolidation reduces memory
Runtime partial reconfiguration replaces non-expert layers
Maintains throughput comparable to single model serving
🔎 Similar Papers
No similar papers found.
H
HamidReza Imani
Department of Electrical and Computer Engineering, The George Washington University
J
Jiaxin Peng
Department of Electrical and Computer Engineering, The George Washington University
P
Peiman Mohseni
Computer Science and Engineering Department, Texas A&M University
Abdolah Amirany
Abdolah Amirany
University of Florida
VLSI DesignEmerging technologiesNeuromorphic design and computingApproximate computing
Tarek El-Ghazawi
Tarek El-Ghazawi
The George Washington University (GWU)
High-Performance ComputingComptuer ArchitectureParallel ProgrammingRemote Sensing