🤖 AI Summary
Deploying 70B-scale large language models (LLMs) efficiently on resource-constrained home devices—characterized by limited memory/VRAM, Wi-Fi interconnects, and CPU/GPU heterogeneity—remains highly challenging.
Method: We propose Halda, a novel inference framework featuring: (i) piped-ring parallelism synergized with prefetching to effectively hide disk I/O and inter-node communication latency; (ii) the Halda algorithm, the first polynomial-time optimal solver for the NP-hard layer allocation problem on heterogeneous devices; and (iii) mmap-based weight mapping, lightweight distributed scheduling, and accurate heterogeneity-aware system modeling.
Results: On a four-node home cluster, Halda successfully runs 30B–70B models—including Llama 3 and DeepSeek R1—with memory footprint under 6% of model size and inference throughput surpassing llama.cpp, Exo, and dllama. This work enables practical deployment of trillion-parameter-scale LLMs in personal computing environments.
📝 Abstract
Emergency of DeepSeek R1 and QwQ 32B have broken through performance barriers for running frontier large language models (LLMs) on home devices. While consumer hardware is getting stronger and model quantization is improving, existing end-side solutions still demand GPU clusters, large RAM/VRAM, and high bandwidth, far beyond what a common home cluster can handle. This paper introduces prima.cpp, a distributed inference system that runs 70B-scale models on everyday home devices using a mix of CPU/GPU, low RAM/VRAM, Wi-Fi, and cross-platform support. It uses mmap to manage model weights and introduces piped-ring parallelism with prefetching to hide disk loading. By modeling heterogeneity in computation, communication, disk, memory (and its management behavior), and OS, it optimally assigns model layers to each device's CPU and GPU, further reducing token latency. An elegant algorithm named Halda is proposed to solve this NP-hard assignment problem. We evaluate prima.cpp on a common four-node home cluster. It outperforms llama.cpp, exo, and dllama on 30B+ models while keeping memory pressure below 6%. This brings frontier 30B-70B models, such as Llama 3, DeepSeek R1, Qwen 2.5, and QwQ to home assistants, making advanced AI truly accessible to individuals. The code is open source and available at https://github.com/Lizonghang/prima.cpp.