CryptoMoE: Privacy-Preserving and Scalable Mixture of Experts Inference via Balanced Expert Routing

📅 2025-11-02

📈 Citations: 0

✨ Influential: 0

career value

233K/year

🤖 AI Summary

Existing private LLM inference frameworks support only dense models and cannot securely scale to Mixture-of-Experts (MoE) architectures, primarily because dynamic routing inherently leaks input privacy. Method: This paper presents the first cryptographically secure inference framework for MoE models. It introduces a load-balanced secure routing protocol, privacy-preserving expert distribution and aggregation mechanisms, and a confidence-aware token selection strategy. Leveraging secure multi-party computation (MPC), batched matrix multiplication, and CipherPrune—an adaptive pruning technique—we significantly reduce computational and communication overhead. Contribution/Results: Evaluated on multiple large-scale MoE models, our framework achieves 2.8–3.5× lower end-to-end latency and 2.9–4.3× reduced communication volume versus baseline private dense-model inference, with negligible accuracy degradation. It simultaneously delivers strong privacy guarantees and high performance, outperforming dense-model baselines across all metrics.

Technology Category

Application Category

📝 Abstract

Private large language model (LLM) inference based on cryptographic primitives offers a promising path towards privacy-preserving deep learning. However, existing frameworks only support dense LLMs like LLaMA-1 and struggle to scale to mixture-of-experts (MoE) architectures. The key challenge comes from securely evaluating the dynamic routing mechanism in MoE layers, which may reveal sensitive input information if not fully protected. In this paper, we propose CryptoMoE, the first framework that enables private, efficient, and accurate inference for MoE-based models. CryptoMoE balances expert loads to protect expert routing information and proposes novel protocols for secure expert dispatch and combine. CryptoMoE also develops a confidence-aware token selection strategy and a batch matrix multiplication protocol to improve accuracy and efficiency further. Extensive experiments on DeepSeekMoE-16.4B, OLMoE-6.9B, and QWenMoE-14.3B show that CryptoMoE achieves $2.8sim3.5 imes$ end-to-end latency reduction and $2.9sim4.3 imes$ communication reduction over a dense baseline with minimum accuracy loss. We also adapt CipherPrune (ICLR'25) for MoE inference and demonstrate CryptoMoE can reduce the communication by up to $4.3 imes$. Code is available at: https://github.com/PKU-SEC-Lab/CryptoMoE.

Problem

Research questions and friction points this paper is trying to address.

Securely evaluating dynamic routing in MoE layers

Protecting sensitive input information during expert selection

Enabling private and efficient inference for MoE models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Balanced expert routing protects sensitive routing information

Secure expert dispatch and combine protocols for privacy

Confidence-aware token selection and batch matrix multiplication

🔎 Similar Papers

On Expert Estimation in Hierarchical Mixture of Experts: Beyond Softmax Gating Functions