Efficient CPU-GPU Collaborative Inference for MoE-based LLMs on Memory-Limited Systems

📅 2025-12-18

📈 Citations: 0

✨ Influential: 0

career value

248K/year

🤖 AI Summary

To address the deployment challenge of Mixture-of-Experts (MoE) large language models on consumer-grade GPUs with limited memory (e.g., 12 GB), this paper proposes a CPU-GPU collaborative inference framework. Our method introduces three key innovations: (1) a novel GPU-resident expert caching mechanism that dynamically retains frequently accessed experts; (2) a CPU multi-threaded cache-miss handling and heterogeneous task scheduling strategy, enabling low-overhead coordination between computation and weight migration; and (3) a lightweight weight migration protocol that significantly reduces data transfer latency. Experimental results demonstrate that, under single-request settings, our framework achieves a 43% reduction in end-to-end latency and a 2.3× improvement in inference throughput. Notably, it enables real-time execution of a 7B-MoE model on a 12-GB GPU for the first time.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have achieved impressive results across various tasks, yet their high computational demands pose deployment challenges, especially on consumer-grade hardware. Mixture of Experts (MoE) models provide an efficient solution through selective activation of parameter subsets, which reduces computation requirements. Despite this efficiency, state-of-the-art MoE models still require substantial memory beyond typical consumer GPU capacities. Traditional offloading methods that transfer model weights between CPU and GPU introduce latency, limiting inference performance. This paper presents a novel CPU-GPU collaborative inference framework that incorporates an expert caching mechanism on the GPU to reduce data transfer requirements and enable faster inference through cache hits. Computations are offloaded to CPU for efficient cache miss handling, which benefits from CPU multithreading optimizations. The evaluations of our framework demonstrate performance improvements and highlight the potential of CPU-GPU collaboration to maximize hardware utilization for single-request inference scenarios on consumer-grade systems. The implementation of our framework is available at https://github.com/elsa-lab/MoE-CPU-GPU-Collaborative-Inference.

Problem

Research questions and friction points this paper is trying to address.

Enables efficient inference of MoE-based LLMs on memory-limited consumer hardware

Reduces CPU-GPU data transfer latency via expert caching on GPU

Optimizes CPU multithreading for handling cache misses in collaborative inference

Innovation

Methods, ideas, or system contributions that make the work stand out.

CPU-GPU collaborative inference with expert caching

Offload computations to CPU for cache misses

Optimize hardware utilization via multithreading on CPU

🔎 Similar Papers

No similar papers found.