🤖 AI Summary
MoE models suffer from high GPU memory consumption due to sparse activation, posing severe memory bottlenecks for edge deployment. To address the low cache efficiency and lack of behavioral modeling in existing offloading strategies, we propose a synergistic offloading mechanism integrating cache optimization and speculative pre-fetching. Specifically, we design an LFU-variant expert caching policy, leverage gating network outputs for fine-grained expert activation prediction, and introduce speculative expert pre-fetching to mitigate I/O latency. Through expert activation trajectory modeling, offline simulation, and comparative evaluation across multiple policies, our method achieves a 23.6% improvement in cache hit rate and reduces peak GPU memory by 31.4% over LRU. This work is the first to uncover the dynamic coupling between gating networks and experts, establishing a novel paradigm for MoE interpretability analysis, lightweight pruning, and efficient edge inference.
📝 Abstract
In today's landscape, Mixture of Experts (MoE) is a crucial architecture that has been used by many of the most advanced models. One of the major challenges of MoE models is that they usually require much more memory than their dense counterparts due to their unique architecture, and hence are harder to deploy in environments with limited GPU memory, such as edge devices. MoE offloading is a promising technique proposed to overcome this challenge, especially if it is enhanced with caching and pre-fetching, but prior work stopped at suboptimal caching algorithm and offered limited insights. In this work, we study MoE offloading in depth and make the following contributions: 1. We analyze the expert activation and LRU caching behavior in detail and provide traces. 2. We propose LFU caching optimization based on our analysis and obtain strong improvements from LRU. 3. We implement and experiment speculative expert pre-fetching, providing detailed trace showing its huge potential . 4. In addition, our study extensively covers the behavior of the MoE architecture itself, offering information on the characteristic of the gating network and experts. This can inspire future work on the interpretation of MoE models and the development of pruning techniques for MoE architecture with minimal performance loss.