🤖 AI Summary
To address the high memory overhead and poor multi-adapter compatibility of Expert-Specialized Fine-Tuning (ESFT) adapters in large-scale deployment, this paper proposes an efficient Mixture-of-Experts (MoE) serving framework. It concurrently executes multiple LoRA-style ESFT adapters atop a shared base model, employs a virtual-memory-assisted expert weight manager to enable fragmentation-free expert co-location, and introduces a fused rerouting kernel for lightweight dynamic routing. The framework achieves low-latency concurrent serving of 20 × 16B MoE adapters on a single accelerator, improving KV cache capacity by 94× and throughput by 18% over baseline, with only 4–11% additional latency. Its core contribution is the first integration of virtual memory management with batched fused kernels for ESFT serving—significantly enhancing resource utilization and scalability.
📝 Abstract
Expert-Specialized Fine-Tuning (ESFT) adapts Mixture-of-Experts (MoE) large language models to enhance their task-specific performance by selectively tuning the top-activated experts for the task. Serving these fine-tuned models at scale is challenging: deploying merged models in isolation is prohibitively resource-hungry, while existing multi-adapter serving systems with LoRA-style additive updates are incompatible with ESFT's expert-oriented paradigm. We present ExpertWeave, a system that serves multiple ESFT adapters concurrently over a single shared MoE base model, drastically reducing the memory footprint and improving resource utilization. To seamlessly integrate into existing inference pipelines for MoE models with non-intrusive modifications and minimal latency overhead, ExpertWeave introduces a virtual-memory-assisted expert weight manager that co-locates base-model and adapter experts without incurring memory overhead from fragmentation, and a fused kernel for batched rerouting to enable lightweight redirection of tokens to the appropriate experts at runtime. Our evaluations show that ExpertWeave can simultaneously serve multiple adapters of a 16B MoE model on a single accelerator where the baseline runs out of memory, or provides up to 94x more KV cache capacity and achieves up to 18% higher throughput while using comparable resources, all without compromising model accuracy. ExpertWeave maintains low overhead even when scaling to 20 adapters, with a 4-11% latency increase compared with serving the base model alone. Source code will be released soon.