ROMER: Expert Replacement and Router Calibration for Robust MoE LLMs on Analog Compute-in-Memory Systems

📅 2026-05-12
📈 Citations: 0
Influential: 0
📄 PDF

career value

258K/year
🤖 AI Summary
This work addresses the severe degradation of expert load balancing in Mixture-of-Experts (MoE) large language models caused by hardware noise in analog in-memory computing, which undermines routing decisions. It presents the first systematic investigation into this failure mechanism and introduces ROMER, a robust framework that restores load balance through expert replacement guided by an accurate chip-level noise model. ROMER further incorporates a percentile-based logits recalibration strategy for the router, enhancing robustness without requiring model retraining. Evaluated on DeepSeek-MoE, Qwen-MoE, and OLMoE, the approach achieves substantial improvements, reducing perplexity by up to 58.6%, 58.8%, and 59.8%, respectively, demonstrating both effectiveness and architectural generality.
📝 Abstract
Large language models (LLMs) with mixture-of-experts (MoE) architectures achieve remarkable scalability by sparsely activating a subset of experts per token, yet their frequent expert switching creates memory bandwidth bottlenecks that compute-in-memory (CIM) architectures are well-suited to mitigate. However, analog CIM systems suffer from inherent hardware imperfections that perturb stored weights, and its negative impact on MoE-based LLMs in noisy CIM environments remains unexplored. In this work, we present the first systematic investigation of MoE-based LLMs under noise model calibrated with real chip measurements, revealing that hardware noise critically disrupts expert load balance and renders clean-trained routing decisions consistently suboptimal. Based on these findings, we propose ROMER, a post-training calibration framework that (1) replaces underactivated experts with high-frequency ones to restore load balance, and (2) recalibrates router logits via percentile-based normalization to stabilize routing under noise. Extensive experiments across multiple benchmarks demonstrate that ROMER achieves up to 58.6\%, 58.8\%, and 59.8\% reduction in perplexity under real-chip noise conditions for DeepSeek-MoE, Qwen-MoE, and OLMoE, respectively, establishing its effectiveness and generalizability across diverse MoE architectures.
Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts
Compute-in-Memory
Hardware Noise
Load Balance
Routing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts
Compute-in-Memory
Analog Noise
Router Calibration
Expert Replacement
Wenyong Zhou
Wenyong Zhou
The University of Hong Kong
Computer Vision
Y
Yuannuo Feng
The School of Integrated Circuit Science and Engineering, Beihang University, Beijing, China
Y
Yizhe Chen
The School of Integrated Circuit Science and Engineering, Beihang University, Beijing, China
Taiqiang Wu
Taiqiang Wu
University of Hong Kong | Tsinghua University
Model CompressionEfficient Methods
W
Wendong Xu
The Department of Electrical and Computer Engineering, The University of Hong Kong, Hong Kong
Wenbo Qi
Wenbo Qi
Nanyang Technological University
Zhengwu Liu
Zhengwu Liu
The University of Hong Kong (HKU) / Tsinghua University (THU)
brain machine interfacescomputing in memorymemristor
Wang Kang
Wang Kang
Beihang University
SpintronicsNonvolatile Memory and Logic CircuitsNon-Von Neumann Computing Architectures
N
Ngai Wong
The Department of Electrical and Computer Engineering, The University of Hong Kong, Hong Kong