🤖 AI Summary
To address high memory overhead and service-level interference when deploying multiple fine-tuned Mixture-of-Experts (MoE) large language models on a single GPU under multi-tenancy, this paper proposes a similarity-driven expert merging and runtime partial reconfiguration mechanism. It clusters semantically similar experts across models to enable cross-model expert sharing—reducing GPU memory footprint—while dynamically switching non-shared layers via Partial Runtime Reconfiguration to preserve output quality. We further design a lightweight MoE model scheduler and a single-GPU multi-instance serving architecture. On a single A100 GPU, our approach reduces average job turnaround time by 85% compared to NVIDIA’s Multi-Instance GPU (MIG), incurs negligible degradation in time-to-first-token (TTFT), and achieves throughput comparable to single-model service levels. Extensive evaluation on Switch Transformer variants demonstrates significant improvements over model fusion baselines.
📝 Abstract
The deployment of mixture-of-experts (MoE) large language models (LLMs) presents significant challenges due to their high memory demands. These challenges become even more pronounced in multi-tenant environments, where shared resources must accommodate multiple models, limiting the effectiveness of conventional virtualization techniques. This paper addresses the problem of efficiently serving multiple fine-tuned MoE-LLMs on a single-GPU. We propose a serving system that employs extit{similarity-based expert consolidation} to reduce the overall memory footprint by sharing similar experts across models. To ensure output quality, we introduce extit{runtime partial reconfiguration}, dynamically replacing non-expert layers when processing requests from different models. As a result, our approach achieves a competitive output quality while maintaining throughput comparable to serving a single model while incurring a negligible increase in time-to-first-token (TTFT). Experiments on a server with a single NVIDIA A100 GPU (80GB) using Mixtral-8x7B models demonstrate an 85% average reduction in turnaround time compared to NVIDIA's multi-instance GPU (MIG). Furthermore, experiments on Google's Switch Transformer Base-8 model with up to four variants demonstrate the scalability and resilience of our approach in maintaining output quality compared to other model merging baselines, highlighting its effectiveness.