DALI: A Workload-Aware Offloading Framework for Efficient MoE Inference on Local PCs

📅 2026-02-03

📈 Citations: 0

✨ Influential: 0

career value

269K/year

🤖 AI Summary

Existing Mixture-of-Experts (MoE) models suffer from CPU-GPU load imbalance and suboptimal resource utilization during local PC inference due to static expert assignment and inefficient prefetching and caching strategies. To address these limitations, this work proposes DALI, a novel framework that, for the first time, formulates expert assignment as a 0-1 integer optimization problem solved in real time. DALI dynamically allocates experts by predicting high-load experts using residual information and designs a workload-aware cache replacement policy that exploits temporal correlations in expert activation patterns. Furthermore, it orchestrates computation across CPU and GPU resources in a coordinated manner. Experimental results demonstrate that DALI significantly accelerates both the prefill and decoding stages across various MoE models, outperforming state-of-the-art offloading frameworks in inference efficiency.

Technology Category

Application Category

📝 Abstract

Mixture of Experts (MoE) architectures significantly enhance the capacity of LLMs without proportional increases in computation, but at the cost of a vast parameter size. Offloading MoE expert parameters to host memory and leveraging both CPU and GPU computation has recently emerged as a promising direction to support such models on resourceconstrained local PC platforms. While promising, we notice that existing approaches mismatch the dynamic nature of expert workloads, which leads to three fundamental inefficiencies: (1) Static expert assignment causes severe CPUGPU load imbalance, underutilizing CPU and GPU resources; (2) Existing prefetching techniques fail to accurately predict high-workload experts, leading to costly inaccurate prefetches; (3) GPU cache policies neglect workload dynamics, resulting in poor hit rates and limited effectiveness. To address these challenges, we propose DALI, a workloaDAware offLoadIng framework for efficient MoE inference on local PCs. To fully utilize hardware resources, DALI first dynamically assigns experts to CPU or GPU by modeling assignment as a 0-1 integer optimization problem and solving it efficiently using a Greedy Assignment strategy at runtime. To improve prefetching accuracy, we develop a Residual-Based Prefetching method leveraging inter-layer residual information to accurately predict high-workload experts. Additionally, we introduce a Workload-Aware Cache Replacement policy that exploits temporal correlation in expert activations to improve GPU cache efficiency. By evaluating across various MoE models and settings, DALI achieves significant speedups in the both prefill and decoding phases over the state-of-the-art offloading frameworks.

Problem

Research questions and friction points this paper is trying to address.

Mixture of Experts

workload dynamics

offloading

CPU-GPU load imbalance

cache efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture of Experts

workload-aware offloading

dynamic expert assignment