Dynamic Expert Sharing: Decoupling Memory from Parallelism in Mixture-of-Experts Diffusion LLMs

📅 2026-01-31

📈 Citations: 0

✨ Influential: 0

career value

234K/year

🤖 AI Summary

This work addresses the memory bandwidth bottleneck in Mixture-of-Experts (MoE) diffusion-based large language models, where parallel decoding causes the number of activated experts to scale linearly with decoding parallelism, undermining the efficiency gains of both MoE and parallel decoding. To mitigate this, the authors propose Dynamic Expert Sharing (DES), which replaces conventional per-token expert pruning with sequence-level expert subset selection, enabling efficient expert reuse across an entire decoding block. The approach introduces two key innovations: intra-sequence expert sharing (DES-Seq) and a salience-aware voting mechanism based on aggregated routing weights (DES-Vote). Experiments demonstrate that DES reduces the number of uniquely activated experts by over 55% while preserving 99% of the original accuracy, achieving up to 38% lower inference latency and effectively decoupling memory overhead from parallelism.

Technology Category

Application Category

📝 Abstract

Among parallel decoding paradigms, diffusion large language models (dLLMs) have emerged as a promising candidate that balances generation quality and throughput. However, their integration with Mixture-of-Experts (MoE) architectures is constrained by an expert explosion: as the number of tokens generated in parallel increases, the number of distinct experts activated grows nearly linearly. This results in substantial memory traffic that pushes inference into a memory-bound regime, negating the efficiency gains of both MoE and parallel decoding. To address this challenge, we propose Dynamic Expert Sharing (DES), a novel technique that shifts MoE optimization from token-centric pruning and conventional expert skipping methods to sequence-level coreset selection. To maximize expert reuse, DES identifies a compact, high-utility set of experts to satisfy the requirements of an entire parallel decoding block. We introduce two innovative selection strategies: (1) Intra-Sequence Sharing (DES-Seq), which adapts optimal allocation to the sequence level, and (2) Saliency-Aware Voting (DES-Vote), a novel mechanism that allows tokens to collectively elect a coreset based on aggregated router weights. Extensive experiments on MoE dLLMs demonstrate that DES reduces unique expert activations by over 55% and latency by up to 38%, while retaining 99% of vanilla accuracy, effectively decoupling memory overhead from the degree of parallelism.

Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts

diffusion LLMs

expert explosion

memory-bound inference

parallel decoding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic Expert Sharing

Mixture-of-Experts

diffusion LLMs