Towards Building Private LLMs: Exploring Multi-Node Expert Parallelism on Apple Silicon for Mixture-of-Experts Large Language Model

πŸ“… 2025-06-30
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Deploying private large language models (LLMs) in personal and small-scale settings faces prohibitive costs and scalability bottlenecks. Method: This paper proposes a multi-node expert-parallel inference framework tailored for Apple Silicon, deploying the MoE-based DBRX model on an M2 Ultra Mac Studio cluster. It identifies network latency as a critical bottleneck in distributed MoE inference, designs memory-management optimizations to eliminate redundant overheads in Apple’s software stack, and develops a lightweight performance modeling methodology for cross-configuration inference latency and throughput prediction. Contribution/Results: Experimental evaluation demonstrates that the proposed solution reduces total cost by 15% compared to an NVIDIA H100 supercomputing system while significantly improving end-to-end inference speed. It establishes the first high-performance, low-cost, and predictable private LLM deployment paradigm native to the Apple ecosystem.

Technology Category

Application Category

πŸ“ Abstract
Large Language Models (LLMs) have revolutionized Artificial Intelligence (AI) with significant advancements such as OpenAI's ChatGPT, Meta's Llama, and Databricks' DBRX. This paper addresses the cost and scalability challenges encountered when constructing private LLM systems for personal or small group services, as aimed by Apple Intelligence. A Mac Studio cluster with Apple's M2 Ultra chips is established as a cost-efficient solution to host and accelerate the pretrained DBRX model with the Mixture-of-Experts (MoE) architecture. Our performance analysis reveal that parallel execution of the model's experts across two to four machine nodes significantly reduces inference time. We find that computation time for the experts is comparable to the communication time for exchanging their outputs, emphasizing the importance of network latency over bandwidth. We also observe significant management overhead due to Apple software stack's memory management logic. Based on these findings, we develop optimization schemes to eliminate the memory management overhead. As a result, the Mac Studio cluster is 1.15 times more cost-efficient than the state-of-the-art AI supercomputer with NVIDIA H100 GPUs. In addition, we construct a performance model to estimate system performance under varying configurations, and the model provides valuable insights for designing private LLM systems.
Problem

Research questions and friction points this paper is trying to address.

Addressing cost and scalability in private LLM systems
Optimizing multi-node expert parallelism on Apple Silicon
Reducing inference time and management overhead in MoE models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-node expert parallelism on Apple Silicon
Optimized memory management for efficiency
Performance modeling for private LLM systems
πŸ”Ž Similar Papers
No similar papers found.
M
Mu-Chi Chen
Dept. of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan; Also with Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, United Arab Emirates
P
Po-Hsuan Huang
Dept. of Computer Science and Information Engineering, National Cheng Kung University, Tainan, Taiwan; Also with Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, United Arab Emirates; Also with Dept. of Computer Science and Information Engineering, National Taiwan University
X
Xiangrui Ke
Dept. of Computer Science, Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, United Arab Emirates
Chia-Heng Tu
Chia-Heng Tu
National Cheng Kung University
Heterogeneous parallel computingEmbedded systems design and optimizationCompiler design
Chun Jason Xue
Chun Jason Xue
Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)
Systems and Storage
Shih-Hao Hung
Shih-Hao Hung
National Taiwan University
Computer ArchitectureParallel ComputingPerformanceVirtualizationGPU