Staleness-Centric Optimizations for Parallel Diffusion MoE Inference

📅 2024-11-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address activation staleness—caused by communication latency—in expert-parallel diffusion Mixture-of-Experts (MoE) inference, this paper proposes DICE, a novel optimization framework centered on staleness modeling. DICE integrates three synergistic strategies: (1) interleaved pipelined scheduling, (2) layer-granularity selective synchronization gating, and (3) token-importance-driven dynamic communication pruning—requiring no additional training. It is the first method to achieve step-level staleness reduction by 50% and token-level fine-grained communication control. Evaluated on diffusion MoE models, DICE maintains near-identical FID and CLIP Score while delivering 1.26× end-to-end inference speedup—significantly outperforming state-of-the-art displaced parallelism approaches. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract
Mixture-of-Experts-based (MoE-based) diffusion models demonstrate remarkable scalability in high-fidelity image generation, yet their reliance on expert parallelism introduces critical communication bottlenecks. State-of-the-art methods alleviate such overhead in parallel diffusion inference through computation-communication overlapping, termed displaced parallelism. However, we identify that these techniques induce severe *staleness*-the usage of outdated activations from previous timesteps that significantly degrades quality, especially in expert-parallel scenarios. We tackle this fundamental tension and propose DICE, a staleness-centric optimization framework with a three-fold approach: (1) Interweaved Parallelism introduces staggered pipelines, effectively halving step-level staleness for free; (2) Selective Synchronization operates at layer-level and protects layers vulnerable from staled activations; and (3) Conditional Communication, a token-level, training-free method that dynamically adjusts communication frequency based on token importance. Together, these strategies effectively reduce staleness, achieving 1.26x speedup with minimal quality degradation. Empirical results establish DICE as an effective and scalable solution. Our code is publicly available at https://anonymous.4open.science/r/DICE-FF04
Problem

Research questions and friction points this paper is trying to address.

Reduces communication bottlenecks in MoE-based diffusion models
Addresses staleness from outdated activations in parallel inference
Optimizes trade-off between speed and quality in expert-parallel scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Interweaved Parallelism reduces step-level staleness
Selective Synchronization protects vulnerable layers
Conditional Communication adjusts token importance dynamically
🔎 Similar Papers
No similar papers found.
J
Jiajun Luo
Tsinghua University
L
Lizhuo Luo
SUSTech
J
Jianru Xu
SUSTech
Jiajun Song
Jiajun Song
Michigan technological University
Wave Energy Converter
Rongwei Lu
Rongwei Lu
Tsinghua University
Distributed machine learninggradient compressionfederated learning
C
Chen Tang
The Chinese University of Hong Kong
Z
Zhi Wang
Tsinghua University