In-depth Analysis on Caching and Pre-fetching in Mixture of Experts Offloading

📅 2025-11-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
MoE models suffer from high GPU memory consumption due to sparse activation, posing severe memory bottlenecks for edge deployment. To address the low cache efficiency and lack of behavioral modeling in existing offloading strategies, we propose a synergistic offloading mechanism integrating cache optimization and speculative pre-fetching. Specifically, we design an LFU-variant expert caching policy, leverage gating network outputs for fine-grained expert activation prediction, and introduce speculative expert pre-fetching to mitigate I/O latency. Through expert activation trajectory modeling, offline simulation, and comparative evaluation across multiple policies, our method achieves a 23.6% improvement in cache hit rate and reduces peak GPU memory by 31.4% over LRU. This work is the first to uncover the dynamic coupling between gating networks and experts, establishing a novel paradigm for MoE interpretability analysis, lightweight pruning, and efficient edge inference.

Technology Category

Application Category

📝 Abstract
In today's landscape, Mixture of Experts (MoE) is a crucial architecture that has been used by many of the most advanced models. One of the major challenges of MoE models is that they usually require much more memory than their dense counterparts due to their unique architecture, and hence are harder to deploy in environments with limited GPU memory, such as edge devices. MoE offloading is a promising technique proposed to overcome this challenge, especially if it is enhanced with caching and pre-fetching, but prior work stopped at suboptimal caching algorithm and offered limited insights. In this work, we study MoE offloading in depth and make the following contributions: 1. We analyze the expert activation and LRU caching behavior in detail and provide traces. 2. We propose LFU caching optimization based on our analysis and obtain strong improvements from LRU. 3. We implement and experiment speculative expert pre-fetching, providing detailed trace showing its huge potential . 4. In addition, our study extensively covers the behavior of the MoE architecture itself, offering information on the characteristic of the gating network and experts. This can inspire future work on the interpretation of MoE models and the development of pruning techniques for MoE architecture with minimal performance loss.
Problem

Research questions and friction points this paper is trying to address.

Optimizing caching algorithms for memory-efficient MoE offloading in edge devices
Implementing speculative pre-fetching to enhance MoE model performance
Analyzing expert activation patterns to improve MoE architecture efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimized LFU caching algorithm for MoE offloading
Implemented speculative expert pre-fetching technique
Provided detailed traces of expert activation patterns
🔎 Similar Papers
No similar papers found.
S
Shuning Lin
Carnegie Mellon University, Pittsburgh, PA, USA
Y
Yifan He
Carnegie Mellon University, Pittsburgh, PA, USA
Yitong Chen
Yitong Chen
Fudan University
Computer Vision