SiftMoE: Similarity-Aware Energy-Efficient Expert Selection for Wireless Distributed MoE Inference

📅 2026-03-24

📈 Citations: 0

✨ Influential: 0

career value

234K/year

🤖 AI Summary

This work addresses the challenge of energy-efficient inference in wireless distributed Mixture-of-Experts (MoE) systems, where the number of experts often exceeds the memory capacity of a single node, necessitating cross-node deployment and making expert selection critical to communication overhead and energy consumption. To this end, the authors propose SiftMoE, which establishes—for the first time—a theoretical bound on the impact of expert skipping or replacement on model accuracy. Leveraging this bound, they design an energy-optimal expert selection strategy that jointly satisfies latency and accuracy constraints. The approach integrates information-theoretic bounds, convex optimization, and wireless channel modeling, supporting both single-token decoding and multi-token prefilling under slow and fast fading channel conditions. Experiments demonstrate that SiftMoE significantly reduces system energy consumption while preserving inference accuracy, outperforming conventional Top-K routing schemes.

Technology Category

Application Category

📝 Abstract

Mixture-of-Experts (MoE) architectures leverage sparse activation to enhance the scalability of large language models (LLMs), making them suitable for deployment in resource-constrained edge networks. However, the sheer number of experts often exceeds the memory capacity of individual edge nodes, necessitating wireless distributed MoE (WIDE) inference where experts are spread across multiple edge nodes. In this context, expert selection directly affects communication costs. Motivated by the similarity of experts, we propose SiftMoE, which judiciously selects or skips experts to strike a tradeoff between communication costs and inference accuracy. Specifically, we first establish theoretical bounds on the accuracy degradation resulting from expert replacement or skipping. Based on the bounds, we formulate an energy minimization problem for expert selection in WIDE inference subject to latency and accuracy constraints. In particular, for slow-fading channels, we derive optimal expert selection policies for both single-token decoding and multi-token prefilling. For fast-fading channels, we further extend our scheme to cope with rapidly varying channel conditions. Simulation results demonstrate that SiftMoE significantly reduces energy consumption while maintaining inference accuracy compared with conventional Top-K routing in WIDE systems.

Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts

wireless distributed inference

expert selection

energy efficiency

edge computing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts

expert selection

energy efficiency