Accelerating Edge Inference for Distributed MoE Models with Latency-Optimized Expert Placement

📅 2025-08-18

📈 Citations: 0

✨ Influential: 0

career value

246K/year

🤖 AI Summary

To address high memory footprint, substantial inter-node communication overhead, and poor adaptability to heterogeneous resources in distributed Mixture-of-Experts (MoE) inference at the edge, this paper proposes DanceMoE—a lightweight, adaptive framework. Methodologically, it introduces (1) an activation-aware expert placement algorithm that jointly leverages sparse activation patterns and workload locality analysis for data-driven, dynamic expert allocation across edge nodes; and (2) a lightweight cross-server expert migration mechanism enabling low-latency collaborative inference. Evaluated on a real-world edge cluster, DanceMoE reduces end-to-end inference latency by up to 30.6% over state-of-the-art baselines, while significantly cutting inter-node communication volume. The framework achieves a favorable trade-off among inference efficiency, scalability, and adaptability to stringent edge resource constraints.

Technology Category

Application Category

📝 Abstract

Mixture-of-Experts (MoE) have become a cornerstone for training and scaling large language models (LLMs), offering substantial gains in model capacity and efficiency through sparse expert activation. However, serving these models remains challenging in practice, particularly in resource-constrained edge environments, due to their large memory footprint and complex communication demands. While centralized cloud inference is common, it incurs high infrastructure costs, along with latency and privacy concerns. A few recent edge MoE works propose memory-efficient strategies but typically focus on single-device or homogeneous setups. This paper presents DanceMoE, an efficient MoE inference framework that enables activation-aware expert placement across collaborative, heterogeneous, GPU-equipped edge servers. DanceMoE leverages the inherent sparsity of MoE models and workload locality to minimize cross-server communication and enable efficient expert placement under heterogeneous resource constraints. It introduces a data-driven, activation-aware placement algorithm that balances local coverage and memory usage across servers, alongside a lightweight migration mechanism that adapts expert assignments under evolving workloads. We evaluate DanceMoE on modern MoE models and widely used datasets, demonstrating up to 30.6% lower inference latency, and substantial communication reduction compared to state-of-the-art baselines, showcasing the effectiveness of collaborative edge-based MoE inference.

Problem

Research questions and friction points this paper is trying to address.

Optimizing expert placement for low-latency edge MoE inference

Reducing cross-server communication in distributed edge environments

Balancing resource constraints and workload demands efficiently

Innovation

Methods, ideas, or system contributions that make the work stand out.

Latency-optimized expert placement algorithm

Activation-aware cross-server communication minimization

Lightweight migration mechanism for workload adaptation

🔎 Similar Papers

Shortcut-connected Expert Parallelism for Accelerating Mixture-of-Experts