🤖 AI Summary
This work addresses the memory and energy bottlenecks of large-scale sparse Mixture-of-Experts (MoE) models during inference. While analog in-memory computing (AIMC) can reduce data movement costs, its practical deployment is hindered by hardware non-idealities and the difficulty of noise-aware retraining for massive MoE architectures. To overcome these challenges, we propose the first training-free digital-analog heterogeneous computing framework. It identifies noise-sensitive experts using a provable maximum neuron norm criterion and leverages module activation density—such as that in attention layers—to allocate critical components to digital computation while mapping the remaining experts onto analog hardware. Evaluated on models like DeepSeekMoE and OLMoE, our approach significantly enhances inference robustness, effectively preserves accuracy across diverse benchmarks, and simultaneously achieves high energy efficiency and strong generalization.
📝 Abstract
Sparse Mixture-of-Experts (MoE) models enable efficient scalability by activating only a small sub-set of experts per input, yet their massive parameter counts lead to substantial memory and energy inefficiency during inference. Analog in-memory computing (AIMC) offers a promising solution by eliminating frequent data movement between memory and compute units. However, mitigating hardware nonidealities of AIMC typically requires noise-aware retraining, which is infeasible for large MoE models. In this paper, we propose a retraining-free heterogeneous computation framework in which noise-sensitive experts, which are provably identifiable by their maximum neuron norm, are computed digitally while the majority of the experts are executed on AIMC hardware. We further assign densely activated modules, such as attention layers, to digital computation due to their high noise sensitivity despite comprising a small fraction of parameters. Extensive experiments on large MoE language models, including DeepSeekMoE and OLMoE, across multiple benchmark tasks validate the robustness of our approach in maintaining accuracy under analog nonidealities.