🤖 AI Summary
This work addresses the severe degradation of expert load balancing in Mixture-of-Experts (MoE) large language models caused by hardware noise in analog in-memory computing, which undermines routing decisions. It presents the first systematic investigation into this failure mechanism and introduces ROMER, a robust framework that restores load balance through expert replacement guided by an accurate chip-level noise model. ROMER further incorporates a percentile-based logits recalibration strategy for the router, enhancing robustness without requiring model retraining. Evaluated on DeepSeek-MoE, Qwen-MoE, and OLMoE, the approach achieves substantial improvements, reducing perplexity by up to 58.6%, 58.8%, and 59.8%, respectively, demonstrating both effectiveness and architectural generality.
📝 Abstract
Large language models (LLMs) with mixture-of-experts (MoE) architectures achieve remarkable scalability by sparsely activating a subset of experts per token, yet their frequent expert switching creates memory bandwidth bottlenecks that compute-in-memory (CIM) architectures are well-suited to mitigate. However, analog CIM systems suffer from inherent hardware imperfections that perturb stored weights, and its negative impact on MoE-based LLMs in noisy CIM environments remains unexplored. In this work, we present the first systematic investigation of MoE-based LLMs under noise model calibrated with real chip measurements, revealing that hardware noise critically disrupts expert load balance and renders clean-trained routing decisions consistently suboptimal. Based on these findings, we propose ROMER, a post-training calibration framework that (1) replaces underactivated experts with high-frequency ones to restore load balance, and (2) recalibrates router logits via percentile-based normalization to stabilize routing under noise. Extensive experiments across multiple benchmarks demonstrate that ROMER achieves up to 58.6\%, 58.8\%, and 59.8\% reduction in perplexity under real-chip noise conditions for DeepSeek-MoE, Qwen-MoE, and OLMoE, respectively, establishing its effectiveness and generalizability across diverse MoE architectures.