TIDE: Efficient and Lossless MoE Diffusion LLM Inference with I/O-aware Expert Offload

📅 2026-05-19

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This work addresses the high I/O overhead and computational bottlenecks encountered when deploying mixture-of-experts (MoE)-based diffusion large language models (dLLMs) on resource-constrained devices. Leveraging, for the first time, the temporal stability of expert activation throughout the diffusion process, the authors propose a lossless, training-free, I/O-aware expert offloading mechanism. This approach employs an interval-based expert refresh strategy within a GPU-CPU cooperative inference architecture and determines the optimal scheduling interval via mathematical programming to dynamically manage expert loading. Evaluated on a single GPU-CPU system, the method achieves up to 1.4× and 1.5× throughput improvements over the baseline on the LLaDA2.0-mini and LLaDA2.0-flash models, respectively.

📝 Abstract

Diffusion Large Language Models (dLLMs) have emerged as a competitive alternative to autoregressive (AR) models, offering better hardware utilization and bidirectional context through parallel block-level decoding. However, as dLLMs continue to scale up with mixture-of-experts (MoE) architectures, their deployment on resource-constrained devices remains an open challenge. Existing AR-based methods often incur either prohibitive I/O overhead or significant compute bottlenecks. In this work, we propose TIDE, a novel resource-efficient inference system that leverages the temporal stability of expert activations during the diffusion process within the block. Specifically, we leverage the temporal stability of expert activations during the diffusion process within the block and introduce an interval-based expert refresh strategy that updates the expert placement in an I/O-aware fashion. To ensure optimal performance, we formulate the inference scheduling as a mathematical programming problem, solving for the optimal interval that minimizes I/O traffic and CPU computation. Most importantly, TIDE is a lossless optimization that requires no model training, providing a "free lunch" acceleration for dLLM inference. In a single GPU-CPU system, we demonstrate that TIDE achieves up to 1.4$\times$ and 1.5$\times$ throughput improvements over prior baselines on LLaDA2.0-mini and LLaDA2.0-flash models, respectively.

Problem

Research questions and friction points this paper is trying to address.

Diffusion Large Language Models

Mixture-of-Experts

I/O overhead

resource-constrained deployment

inference efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

diffusion LLM

mixture-of-experts

I/O-aware offloading